Teradata Warehouse Miner User Guide

What would you do if you knew?™

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling Release 5.4.4 B035-2300-077K July 2017 The product or products described in this book are licensed products of Teradata Corporation or its affiliates.

Teradata, Applications-Within, Aster, BYNET, Claraview, DecisionCast, Gridscale, QueryGrid, SQL-MapReduce, Teradata Decision Experts, "Teradata Labs" logo, Teradata ServiceConnect, Teradata Source Experts, WebAnalyst, and Xkoto are trademarks or registered trademarks of Teradata Corporation or its affiliates in the United States and other countries. Adaptec and SCSISelect are trademarks or registered trademarks of Adaptec, Inc. Amazon Web Services, AWS, Amazon Elastic Compute Cloud, Amazon EC2, Amazon Simple Storage Service, Amazon S3, AWS CloudFormation, and AWS Marketplace are trademarks of Amazon.com, Inc. or its affiliates in the United States and/or other countries. AMD Opteron and Opteron are trademarks of Advanced Micro Devices, Inc. Apache, Apache Avro, Apache Hadoop, Apache Hive, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. Apple, Mac, and OS X all are registered trademarks of Apple Inc. Axeda is a registered trademark of Axeda Corporation. Axeda Agents, Axeda Applications, Axeda Policy Manager, Axeda Enterprise, Axeda Access, Axeda Software Management, Axeda Service, Axeda ServiceLink, and Firewall-Friendly are trademarks and Maximum Results and Maximum Support are servicemarks of Axeda Corporation. CENTOS is a trademark of Red Hat, Inc., registered in the U.S. and other countries. Cloudera and CDH are trademarks or registered trademarks of Cloudera Inc. in the United States, and in jurisdictions throughout the world. Data Domain, EMC, PowerPath, SRDF, and Symmetrix are either registered trademarks or trademarks of EMC Corporation in the United States and/or other countries. GoldenGate is a trademark of Oracle. Hewlett-Packard and HP are registered trademarks of Hewlett-Packard Company. Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries. Intel, Pentium, and XEON are registered trademarks of Intel Corporation. IBM, CICS, RACF, Tivoli, and z/OS are registered trademarks of International Business Machines Corporation. Linux is a registered trademark of Linus Torvalds. LSI is a registered trademark of LSI Corporation. Microsoft, Active Directory, Windows, Windows NT, and Windows Server are registered trademarks of Microsoft Corporation in the United States and other countries. NetVault is a trademark of Quest Software, Inc. Novell and SUSE are registered trademarks of Novell, Inc., in the United States and other countries. Oracle, Java, and Solaris are registered trademarks of Oracle and/or its affiliates. QLogic and SANbox are trademarks or registered trademarks of QLogic Corporation. Quantum and the Quantum logo are trademarks of Quantum Corporation, registered in the U.S.A. and other countries. Red Hat is a trademark of Red Hat, Inc., registered in the U.S. and other countries. Used under license. SAP is the trademark or registered trademark of SAP AG in Germany and in several other countries. SAS and SAS/C are trademarks or registered trademarks of SAS Institute Inc. Simba, the Simba logo, SimbaEngine, SimbaEngine C/S, SimbaExpress and SimbaLib are registered trademarks of Simba Technologies Inc. SPARC is a registered trademark of SPARC International, Inc. Unicode is a registered trademark of Unicode, Inc. in the United States and other countries. UNIX is a registered trademark of The Open Group in the United States and other countries. Veritas, the Veritas Logo and NetBackup are trademarks or registered trademarks of Veritas Technologies LLC or its affiliates in the U.S. and other countries. Other product and company names mentioned herein may be the trademarks of their respective owners. The information contained in this document is provided on an "as-is" basis, without warranty of any kind, either express or implied, including the implied warranties of merchantability, fitness for a particular purpose, or non- infringement. Some jurisdictions do not allow the exclusion of implied warranties, so the above exclusion may not apply to you. In no event will Teradata Corporation be liable for any indirect, direct, special, incidental, or consequential damages, including lost profits or lost savings, even if expressly advised of the possibility of such damages. The information contained in this document may contain references or cross-references to features, functions, products, or services that are not announced or available in your country. Such references do not imply that Teradata Corporation intends to announce such features, functions, products, or services in your country. Please consult your local Teradata Corporation representative for those features, functions, products, or services available in your country. Information contained in this document may contain technical inaccuracies or typographical errors. Information may be changed or updated without notice. Teradata Corporation may also make improvements or changes in the products or services described in this information at any time without notice. To maintain the quality of our products and services, we would like your comments on the accuracy, clarity, organization, and value of this document. Please e-mail: [email protected] Any comments or materials (collectively referred to as "Feedback") sent to Teradata Corporation will be deemed non-confidential. Teradata Corporation will have no obligation of any kind with respect to Feedback and will be free to use, reproduce, disclose, exhibit, display, transform, create derivative works of, and distribute the Feedback and derivative works thereof without limitation on a royalty-free basis. Further, Teradata Corporation will be free to use any ideas, concepts, know-how, or techniques contained in such Feedback for any purpose whatsoever, including developing, manufacturing, or marketing products or services incorporating Feedback. Copyright © 1999 - 2017 by Teradata. All Rights Reserved. Preface

Purpose This volume introduces the features of Teradata® Warehouse Miner and its derivative products Teradata ADS Generator and Teradata Profiler, while describing in detail the features associated primarily with the Teradata Profiler product, which helps to understand the nature and quality of data in a Teradata data warehouse.

Audience This volume is written for users of Teradata Warehouse Miner, Teradata ADS Generator or Teradata Profiler, who should be familiar with Teradata SQL, the operation and administration of the Teradata RDBMS (Relational Database Management System) and statistical techniques. They should also be familiar with the Microsoft Windows operating environment and standard Microsoft Windows operating techniques.

Revision History

Date Release Description July 2017 5.4.4 Maintenance release October 2016 5.4.2 Maintenance Release January 2016 5.4.1 Maintenance Release July 2015 5.4.0 Feature Release Jume 2014 5.3.5 Maintenance Release September 2013 5.3.4 Maintenance Release June 2012 5.3.3 Maintenance Release June 2011 5.3.2 Maintenance Release June 2010 5.3.1 Maintenance Release October 2009 5.3.0 Feature Release February 2009 5.2.2 Maintenance Release December 2008 5.2.1 Maintenance Release May 2008 5.2.0 Feature Release January 2008 5.1.1 Maintenance Release July 2007 5.1.0 Feature Release

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 3 Preface Additional Information

Date Release Description November 2006 5.0.1 Maintenance Release September 2006 5.0.0 Major Release

Additional Information

Related Links

URL Description https://tays.teradata.com Use Teradata At Your Service to access Orange Books, technical alerts, and knowledge repositories, view and join forums, and download software packages. http://www.teradata.com External site for product, service, resource, support, and other customer information.

Related Documents Publications are located at: http://www.info.teradata.com

Title Publication ID SQL Data Manipulation Language B035-1146 Teradata Warehouse Miner Release Definition B035-2494 Teradata Warehouse Miner User Guide, Volume 2 B035-2301 ADS Generation Teradata Warehouse Miner User Guide, Volume 3 B035-2302 Analytic Functions Teradata Warehouse Miner Model Manager User B035-2303 Guide

Customer Education

URL Description www.teradata.com/TEN/ Teradata Education Network Portal

Product Safety Information This document may contain information addressing product safety practices related to data or property damage, identified by the word Notice. A notice indicates a situation which, if not avoided, could result in damage to property, such as equipment or data, but not related to personal injury.

4 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Preface Product Safety Information Example

Notice: Improper use of the Reconfiguration utility can result in data loss.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 5 Preface Product Safety Information

6 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 CHAPTER 1 Introduction to Teradata Warehouse Miner

Overview This document describes how to use the interfaces and the analytic functionality available in Teradata Warehouse Miner, and assumes a working knowledge of the operation and administration of the Teradata Relational Database Management System (RDBMS) system, including Teradata SQL, and a working knowledge of statistical techniques.

Overview of Teradata Warehouse Miner Data warehousing has become a required component of business information technology today. In recent years, data mining has become a key aspect of decision support and customer relationship management applications built on top of the data warehouse and a crucial component in exploiting their inherent value. The Teradata Relational Database Management System (RDBMS) software is the leading technology available today for building data warehouses. Although Teradata data warehouses span a wide range of system sizes, from entry level servers to the largest massively parallel data warehouses in the world, they have in common unparalleled decision support performance and scalability. Teradata Warehouse Miner is software that allows users to perform data mining entirely within a Teradata warehouse. Representing a dramatic shift from past non-warehouse-resident data mining architectures, Teradata Warehouse Miner users perform data mining without the additional hardware, software, and associated data management processes those architectures require. Additionally, the product is separated into four distinct offerings, allowing different types of Teradata users the functionality they need to perform data profiling, analytic data creation and model building, scoring and evaluation. The first of these three offerings is the Teradata Profiler. The components available in this offering were developed to provide a comprehensive data profiling tool that does not require any movement of data outside of the warehouse, using as much of the data as desired, storing results directly in the database, and utilizing the parallel, scalable processing power of Teradata to perform data intensive operations. A wide variety of descriptive statistics are available to generate reports and graphics with drill down capabilities, pointing out potential issues with data quality. The second offering is the Teradata Data Set Builder for SAS (also known as the Teradata Analytic Data Set (ADS) Generator). This includes all of the components within Teradata Profiler in addition to analyses that aid in the generation of Analytic Data Sets, analyses that can build and export a Correlation Matrix (and related matrix types), an analysis to score models that are described using the Predictive Model Markup Language (PMML) and an analysis to publish analytic data sets and/or models for deployment through the Teradata Model Manager web-based application. The need to build Analytic Data Sets derives from the fact that the data associated with the highly normalized data models within the Warehouse are not suited for mining directly. A precursor to the creation of analytic models is therefore the creation of an Analytic Data Set (ADS), which is a denormalized

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 7 Chapter 1: Introduction to Teradata Warehouse Miner TWM Architecture data structure often referred to as being in “observation format.” That is, the analytic algorithms need to be presented the data in a flat structure in which all of the variables are present for each entity (customer, household, account, etc.) being modeled. Teradata ADS Generator has components that aid in the creation of these variables, dimensioning or denormalizing them, as well as statistically transforming them. Additional components allow tables to be sampled, partitioned, denormalized and joined together. The third offering is Teradata Warehouse Miner. In addition to the features of the Teradata Profiler and ADS Generator, it provides analytic algorithms that make binomial, multinomial and continuous predictions via Logistic Regression, Decision Tree and Linear Regression algorithms. Dimensionality reduction is offered with several flavors of Factor Analysis, while the Clustering algorithm provides an interactive mechanism for customer segmentation and for solving various business problems where similar grouping is desired. The Association Rules algorithm with optional Sequence Analysis provides a solution for problems such as market basket analysis and channel usage analysis. With the exception of Association Rules, all of the models produced by the algorithms can be evaluated and scored using features of the product. Finally, the third offering includes a collection of 17 Statistical Tests, including various Binomial, Kolmogorov-Smirnov, Parametric, Rank and Contingency Table tests.

TWM Architecture Teradata Warehouse Miner includes a graphical user interface, programmatic Microsoft® .NET interfaces for each analysis, as well as Metadata and Teradata database interfaces. All of these interfaces reside on the client platform and are outlined in this section. The Teradata Warehouse Miner architecture consists of three logical tiers. The first tier, known as the “User Interface” contains the Teradata Warehouse Miner Graphical User Interface (GUI). The second tier, known as the “Business” tier, contains core Teradata Warehouse Miner analysis components, as well as components that allow access to the Teradata database. The Teradata Database represents the third tier, which is known as the “Database” tier. In addition to running analyses against the database, Teradata Warehouse Miner also uses the database to store information (metadata) about these analyses. Although the first two tiers execute from a client, they all operate directly against a Teradata system, and are persisted within a metadata model that resides in Teradata. This metadata is a collection of one or more “Analyses” contained within a Teradata Warehouse Miner Data Mining “Project.” The Data Mining Project acts as an index into the metadata, allowing analyses to be created, saved, removed, modified and executed from the Project. The Teradata Warehouse Miner framework is illustrated graphically below:

8 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 1: Introduction to Teradata Warehouse Miner TWM Architecture TWM Architecture

User Interface The Teradata Warehouse Miner user interface is the entry point for end user interaction with the product. The interface runs on the Microsoft Windows® platform, sharing common functionality with other Microsoft products through the use of common controls such as menus, child windows, dialog boxes, and context-sensitive help. Through this interface, the end-user can utilize all of the business component functionality using the mouse and keyboard. The graphical layout of the main executable consists of a background workspace, a menu bar, a toolbar, and two tool windows. The first tool window, named the “Project Window” contains any open Teradata Warehouse Miner projects that you are working with. Additionally, any valid Windows object can be added as an “attachment” to the project. These attachments are saved directly in the Teradata database. The second tool window, named the “Execution Status Window” displays meaningful information about the analyses executed from within the project. The toolbar contains shortcut buttons for many common menu items. The executable allows multiple child windows to be displayed in the background workspace at any time. These windows represent each of the Teradata Warehouse Miner analyses in any open project.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 9 Chapter 1: Introduction to Teradata Warehouse Miner TWM Architecture Business Services

Analyses The analysis user interfaces are grouped according to functionality. Each component contains multiple controls, one for each algorithm. Each control uses smaller controls and dialog boxes to display all of the parameters for the algorithm that it represents. The actual components available depend on the particular Teradata Warehouse Miner product installed and whether you are connected to Aster or Teradata. All possible components are listed below. • ADS* 1. Build ADS* 2. Variable Creation* 3. Variable Transformation* 4. Refresh* • Analytics 1. Association Rules 2. Clustering 3. Decision Tree 4. Factor Analysis 5. Linear Regression 6. Logistic Regression • Descriptive Statistics* 1. Adaptive Histogram 2. Correlation Matrix 3. Data Explorer* 4. Frequency* 5. Histogram* 6. Overlap* 7. Scatter Plot* 8. Statistics* 9. Text Field Analyzer* 10. Values* • Matrix Functions 1. Export Matrix 2. Matrix • Miscellaneous* 1. Free-Form SQL* • Publishing 1. Publish Models* • Reorganization* 1. Denorm* 2. Join* 3. Merge*

10 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 1: Introduction to Teradata Warehouse Miner TWM Architecture 4. Partition 5. Sample* • Scoring 1. Cluster Scoring 2. Decision Tree Scoring 3. Factor Analysis Scoring 4. Linear Regression Scoring 5. Logistic Regression Scoring 6. PMML Scoring • Statistical Tests 1. Parametric Tests a. T-Test b. F-Test N-Way c. F-Test / Analysis of Variance 2. Binomial Tests a. Binomial Test b. Binomial Sign Test 3. Kolmogorov-Smirnov Tests a. Kolmogorov-Smirnov Test b. Lilliefors Test c. Shapiro-Wilk Test d. D'Agostino and Pearson Test e. Smirnov Test 4. Tests Based on Contingency Tables a. Chi Square Tests b. Median Test 5. Rank Tests a. Mann-Whitney/Kruskal Wallis Test b. Wilcoxon Signed Ranks Test c. Friedman Test

Note: Those analyses supported on both Aster and Teradata are noted with an asterisk (*).

Each analysis creates results when it completes. These results are comprised of four different visualizations: text, tables, data, and graphics. Once an analysis is complete, results are retrieved and displayed within the results tab of the analysis. When you select a page of the report by clicking on a hyperlink, the contents of that page are loaded into the form on the right side of the control. Available graphs (again, depending on the Teradata Warehouse Miner product installed) include: • Descriptive Statistics* 1. Values Circular and Bar Graphs* 2. Statistical Box and Whisker Plots* 3. 2 and 3-D Histograms* 4. 2 and 3-D Frequency Graphs*

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 11 Chapter 1: Introduction to Teradata Warehouse Miner TWM Architecture 5. 2 and 3-D Scatter Plots* 6. Data Explorer Thumbnails and Graphs* • Analytics 1. Association Rules Tile Map 2. Cluster Analysis Size and Distances Plot and Similarity Chart 3. Decision Tree interactive Tree Browser, Rule Text and Lift Chart 4. Factor Analysis Factor Patterns and Scree Plot 5. Linear Regression Bar chart for coefficients and T-statistics 6. Logistic Regression Lift Chart 7. Logistic Regression Bar chart for estimated standard coefficients, T-statistics, Partial R, Log Odds Ratios and Wald statistics • Scoring 1. Decision Tree Lift Chart 2. Logistic Regression Lift Chart

Note: Those analyses supported on both Aster and Teradata are noted with an asterisk (*).

Analysis Execution Teradata Warehouse Miner uses database connection session pooling so that a persistent connection to the database is not required. When an analysis is executed through the Teradata Warehouse Miner user interface, it gets a connection from the pool, generates the SQL appropriate for the analysis, and dynamically executes it through the same connection. An entire project can be sequentially executed and/or individual analyses. Because the analyses are executed on another thread, other analyses within the project can be viewed, changed or executed. There is no theoretical limit to the number of analyses that can be executed at one time, although each one results in a different session in Teradata; care should be taken not to execute too many at once.

Project and Analysis Representation All of the Teradata Warehouse Miner analyses are stored as intermediate objects within the project. Even though the underlying algorithms for each analysis are different, all of the analyses are stored here using the same object. The parameters and results properties of each object contain .NET classes that differ in appearance according to the underlying algorithm. As parameters are set through the user interface, the analysis is updated automatically prior to executing or saving the analysis. Analyses must exist in one and only one project. Existing analyses can be added to multiple projects, however, this results in a copy of the analyses being added as opposed to shared across projects. As such, analyses are not saved individually; when a project is saved, all new and changed analyses are saved. Any number of projects can be created and/or opened within a given instance of Teradata Warehouse Miner.

12 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 1: Introduction to Teradata Warehouse Miner TWM Architecture Data Services

Teradata Access The Teradata Access service is used by all of the user interface components as well as the analysis components to access data in the Teradata database. This service encapsulates all of the SQL generation for the other services in a single location. The service is split up into separate components to handle specific SQL functions: connecting, metadata access, DBC access, and data access. Teradata Warehouse Miner no longer requires a persistent connection to the Teradata database. A Teradata ODBC connection is required only when selecting data to operate against, and when an analysis is executed. Teradata Warehouse Miner operates exclusively against data stored in Teradata. See the Teradata Warehouse Miner Release Definition document, B035-2492, for the appropriate release of Teradata Warehouse Miner for information about the releases of the Teradata RDBMS that are supported with the product.

Teradata Query Banding Teradata Warehouse Miner utilizes the Query Banding feature introduced in Teradata 12.0. This feature provides a mechanism for applications to associate metadata with Teradata queries that may be useful for various Teradata utilities in identifying, managing and accounting for those queries. Teradata Warehouse Miner utilizes Query Banding only when it executes an analysis or project, not when performing maintenance functions or metadata queries and not inside any stored procedures that may be generated by the product. Query Banding is also not performed when executing an analytic algorithm, algorithm scoring or matrix building analysis.

Note: Query Banding is performed even when an analysis is marked “Generate-SQL-Only”.

Teradata Warehouse Miner sets certain query band name-value pairs at the SESSION level, not at the TRANSACTION level. These name-value pairs include: ClientUser= ApplicationName=TWM Version= Source=

For example, when user twm executes project Test Project using TWM 05.03.03, the query band is set internally using the following command: SET QUERY_BAND = 'ClientUser=twm;ApplicationName=TWM;Version=05.03.03;Source=Test Project;' FOR SESSION;

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 13 Chapter 1: Introduction to Teradata Warehouse Miner TWM Architecture

14 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 CHAPTER 2 Installation and Configuration

Overview The following sections outline the hardware and software requirements for the machine running Teradata Warehouse Miner. Additionally, instructions for generating the Metadata model required for Teradata Warehouse Miner are given.

Installation The Teradata Warehouse Miner installation process installs all components and Microsoft dependent software required to run Teradata Warehouse Miner.

Operating Environment The Teradata Warehouse Miner components are designed to run in a two-tier client/server environment.

Hardware Requirements Hardware requirements for Teradata Warehouse Miner are: • 1 GHz or greater Pentium Processor • 1 GB or greater RAM • 500MB or greater disk space • Minimum Screen Resolution of 1024x768

Software/System Requirements One of the following operating systems is required: • Microsoft Windows 7 Professional • Microsoft Windows 8 Enterprise • Microsoft Windows 8 Professional, Service Pack 1 or later • Microsoft Windows 10 Professional

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 15 Chapter 2: Installation and Configuration Before Starting the Installation Software Dependencies Teradata Warehouse Miner requires the following software packages installed on the client PC: • Microsoft Windows Internet Explorer Version 8.0 or later • Teradata ODBC Driver (32-bit) Versions 15.00, 15.10 or 16.00, or 16.10 using the latest fix release • Aster ODBC Driver (32-bit) Versions 6.0, 6.10 or 6.20, using the latest fix release • Microsoft MDAC (Data Access Components) 2.8 • Microsoft .NET Runtime 4.0 (or later)

Teradata Database Dependencies See the Teradata Warehouse Miner Release Definition document, B035-2494, for the appropriate release of Teradata Warehouse Miner for information about the releases of the Teradata RDBMS that are supported with the product and for updates regarding supported releases of other software. Initially, the following Teradata releases are supported: • Teradata 15.00 • Teradata 15.10 • Teradata 16.00 • Teradata 16.10

Aster Database Dependencies See the Teradata Warehouse Miner Release Definition document, B035-2494, for the appropriate release of Teradata Warehouse Miner for information about the releases of the Aster RDBMS that are supported with the product and for updates regarding supported releases of other software. Initially, the following Aster database releases are supported: • Aster Database 6.0 • Aster Database 6.10 • Aster Database 6.20

Before Starting the Installation Before starting the installation, it is very important to shut down all other programs. Additionally, any previous version of any of the Teradata Warehouse Miner products, version 4 or higher, must be uninstalled before beginning installation. Instructions for removing Teradata Warehouse Miner are given in subsequent sections.

Running the TWM Setup Program 1. Insert the CD-ROM containing the Teradata Warehouse Miner product being installed. If the installation program does not automatically begin, open the TWM.msi file on the installation media. The initial installation screen will then appear.

16 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 2: Installation and Configuration Installing TWM Information about the specific release dependencies for products that Teradata Warehouse Miner is dependent on is found in the Teradata Warehouse Miner Release Definition document, B035-2492, for the release of Teradata Warehouse Miner that is being installed. It is particularly critical that a sufficiently recent version of the Teradata ODBC Driver be installed on the machine that Teradata Warehouse Miner is installed on.

Installing TWM When installing one of the Teradata Warehouse Miner products, the installation program determines if all dependent Microsoft software has been installed. This includes: • Microsoft .NET Framework 4.0 If the .NET Framework is not installed, the installation program prompts you to download it. • Click Yes to download the latest .NET Framework version 4.0 from Microsoft.com. Once all dependencies have been determined or installed, the installation of the Teradata Warehouse Miner product begins with the Welcome dialog. 1. Click Next to go to the Information dialog. 2. Read the notice carefully and click Next to go to the Select Installation Folder dialog. 3. Click Next if this is a desirable installation location. Otherwise, click Browse to change the installation path. 4. Browse to or type in the desired installation location and click OK to return to the Select Installation Folder dialog.

Note: If you would like to see the disk space required to install the Teradata Warehouse Miner product, click Disk Cost. 5. Click OK to return to the Select Installation Folder dialog, and click Next to go to the Confirmation dialog. 6. Click Next to install the Teradata Warehouse Miner product. The installation may take a minute or two depending upon the system configuration. A screen indicating the progress of the installation is displayed.

Notice: If a status message is displayed indicating that a reboot is necessary, it is VERY important to reboot as indicated. If the installation process determines that a reboot is not necessary, the Installation Complete dialog appears. 7. Click Close to complete the installation.

Uninstalling TWM Perform the following steps to uninstall the Teradata Warehouse Miner product that is currently installed. 1. Go to the Control Panel and click on Programs and Features. The Programs and Features dialog appears.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 17 Chapter 2: Installation and Configuration Upgrading TWM 2. Highlight the currently installed Teradata Warehouse Miner product. 3. Click Remove to begin the “Uninstall Process.” 4. Click on Yes to the Add/Remove Programs confirmation dialog. A series of Microsoft Windows Installer dialogs are shown on the screen, but no interaction is required.

Upgrading TWM In order to upgrade or install an update of one of the Teradata Warehouse Miner products, it is very important to uninstall the product before installing the updated product. See Uninstalling TWM. Once the existing installation of TWM has been uninstalled, the new version can be safely installed. See Installing TWM. It is important to note that all user configuration, projects and metadata are retained even after uninstalling TWM. This includes data stored both locally and on the server side, as outlined below: • Local - The following data is retained on the user’s local system: ∘ User Preferences ∘ Connection Options (by each data source) • Database - The following data is retained on the database server: ∘ Project Metadata ∘ Publish Metadata ∘ Advertise Output Metadata Upon re-installation of TWM, this data is seamlessly available to the user as it was in the previous installation.

Configuration Use the configuration guidelines in the following sections to ensure optimal performance of Teradata Warehouse Miner. Additionally, the Teradata Warehouse Miner Metadata and Demonstration data installation instructions are outlined. • Manual Configuration • Teradata Software/System Requirements • TWM Databases • TWM Metadata • Installing Support Tables and Functions • Teradata Warehouse Miner ODBC Data Sources • Workload Management

Manual Configuration If, while attempting to Load Demo Data, Load Stats Test Data or Install UDFs, you receive the error “Requested registry access is not allowed”, you must execute the utility manually and “Run as administrator”.

18 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 2: Installation and Configuration Configuration 1. On the Start menu, navigate to All Programs > Accessories > Command Prompt. 2. Right-click on Command Prompt, selecting Run as administrator from the pop-up menu. 3. In the Command Prompt window, change directory to the TWM installation directory (e.g., cd "C:\Program Files (x86)\Teradata\Teradata Warehouse Miner 5.4.4") and run one of the following commands (depending on which utility you desire to execute): • Load Teradata Demo Data: NewDataLoad.exe • Load Aster Demo Data: NewDataLoad.exe -a • Load Teradata Stats Test Data: NewDataLoad.exe -s • Install UDFs: cd "Scripts\UDF Install and Scripts"; InstallUdfs.exe

Teradata Software/System Requirements The Teradata configuration requirements and required DBS control parameters (PERMSPACE and SPOOLSPACE) are shown below. See the Teradata Warehouse Miner Release Definition document, B035-2494, for information about the releases of the Teradata RDBMS that are supported with the product.

PERMSPACE The amount of PERMSPACE required by Teradata Warehouse Miner is dependent upon the user. All functions can create tables and views, or the results can be simply selected out. Creating a User/Database with "PERM=1000000000" is considered minimal. This is not critical unless results are persisted in Teradata tables or views.

SPOOLSPACE The amount of SPOOLSPACE required by Teradata Warehouse Miner is dependent upon the size of the tables being operated on. Creating a User/Database with "SPOOL=1000000000" is considered minimal. Ideally, Teradata Warehouse Miner users inherit the maximum available spoolspace from DBC.

TWM Databases Teradata Warehouse Miner has several database concepts including a Source Database, Result Database, Metadata Database, Statistical Test Database, Publish Database and Advertise Database. These can all refer to the same physical Teradata Database or five distinct databases. The following table defines each database, along with the necessary Teradata access rights.

Database Definitions

Database Definition Access Rights Source Database This is the database where the SELECT tables to be analyzed exist. By default, this is equivalent to the Default Database defined in the Teradata ODBC data source, but it can be modified globally on the Connection Properties dialog, or changed for each

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 19 Chapter 2: Installation and Configuration Configuration

Database Definition Access Rights analysis on the Output-storage panel. Result Database This is the database where CREATE and SELECT WITH Teradata Warehouse Miner GRANT if other analyses will be builds result tables/views. By executed against Views created default, this is equivalent to the in Result Database. “Userid” defined in the Teradata ODBC data source, but it can be modified globally on the Add/ Remove Programs dialog, or changed for each analysis on the Output-storage panel. Metadata Database This is the database where the CREATE TABLE, VIEW, Teradata Warehouse Miner MACRO, PROCEDURE, metadata resides. By default, this FUNCTION, as well as ALTER is equivalent to the “Userid” FUNCTION to use the Metadata defined in the Teradata ODBC Creation feature, UDF creation data source, but it can be or initial Drill-Down saving; modified on the Add/Remove UPDATE otherwise. Programs dialog. Statistical Test Database This is the database where the CREATE to use the program metadata that supports the item to create Statistical Test Statistical Test functions resides. Tables; SELECT otherwise. The metadata is created using the program item "Load Statistical Test Metadata". Publish Database This is the database where the CREATE to use the program metadata that supports the item to create Publish Tables or Publish function resides. It is to publish for the first time after created using the Publish Tables migrating to 5.2 or later; Creation item on the Tools UPDATE otherwise. menu. Advertise Database This is the database where the CREATE to use the program Advertise Output metadata item to create Advertise Tables; resides. It is created using the UPDATE otherwise. Advertise Tables Creation items on the Tools menu.

TWM Metadata Teradata Warehouse Miner requires that certain metadata tables associated with the product be present to operate properly. The Teradata Warehouse Miner metadata tables store information about projects, analyses and attachments created by users. When using Teradata Warehouse Miner, these tables must be present in one of two places:

20 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 2: Installation and Configuration Configuration • By default, they must be present in the “containing” database (Userid) that you connect with via an ODBC data source. This database is used as the default for the Teradata Warehouse Miner Metadata Database. • Optionally, this can be changed to a database other than the “containing” database by changing the name of the “Metadata Database” in the Connection Properties dialog. This is discussed in the following sections. ∘ Installing Metadata ∘ Installing Publish Metadata ∘ Installing Advertise Output Metadata

Installing Metadata In order to install the metadata tables, you must execute the Metadata Creation… wizard from the Tools menu. The Metadata Creation wizard will either create the new metadata in the specified metadata database or overwrite the current metadata. 1. Create the new metadata in the specified Metadata Database. 2. Overwrite current TWM 4 or higher metadata with new metadata in the specified Metadata Database. 3. Select Metadata Creation from the Tools menu. The Metadata tables can be created either With Fallback or Without Fallback as shown below. Metadata Creation

If there is currently no connection established to Teradata, the Metadata Creation wizard prompts you with the Select Data Source dialog as previously described. It is highly advisable to first establish a connection and verify the Metadata Database name on the Connection Properties dialog. This is the database where the creation and/or migration will take place. If no metadata exists in the current metadata database, a confirmation dialog is displayed as shown below. The database is displayed along with the choice of with or without fallback protection.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 21 Chapter 2: Installation and Configuration Configuration Metadata Creation Confirmation Dialog

Metadata Creation Complete Dialog

4. Click Yes. The process takes from seconds to minutes, depending upon the contention with the Teradata dictionary. A message is displayed indicating the destination database, followed eventually by the final confirmation shown below.

5. Click OK to return to the Teradata Warehouse Miner user interface. Now projects can be saved in the metadata database. If TWM 4 or higher metadata exists in the current metadata database, the Metadata Creation wizard prompts you as follows: Metadata Creation: Overwrite Current

6. Click Yes to overwrite the metadata tables or No to return to the Teradata Warehouse Miner user interface. If Yes is selected, a confirmation dialog is displayed indicating the database, the number of analyses about to be destroyed, and the requested fallback option (with or without). Metadata Creation: Overwrite Confirmation Dialog

22 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 2: Installation and Configuration Configuration If this confirmation dialog is answered Yes, the same completion dialog is eventually displayed as shown above. Click OK to return to the Teradata Warehouse Miner user interface. If Yes is answered, a confirmation dialog is displayed indicating the database and fallback protection option (with or without), just as it would be if an older version of metadata was not present. Answering this confirmation dialog Yes causes the existing metadata tables to be removed and replaced by new tables that contain no projects.

Installing Publish Metadata If not using the Teradata Profiler product, before using the Publish analysis to publish models for the Model Manager application, special metadata tables must be created for use by this analysis and the Model Manager. 1. From the Tools menu, select the desired table option to create tables with or without the following protection as shown below: • Publish Tables Creation > With Fallback • Publish Tables Creation > Without Fallback The Teradata Profiler does not include publishing related functions. Publish Tables

2. If the Publish Tables do not exist in the current publish database, the following confirmation dialog appears. The database name is displayed along with the choice of with or without fallback protection. Publish Tables Confirmation Dialog

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 23 Chapter 2: Installation and Configuration Configuration 3. If Yes is answered, the publish tables are created. If, however, the Publish Tables already exist in the current publish database, the following message is given. Publish Tables: Overwrite Current

4. If Yes is answered, the following confirmation dialog appears. Note that it indicates how many models will be destroyed if the Publish Tables are overwritten. Publish Tables: Overwrite Confirmation Dialog

5. If Yes is answered, the publish tables are created, eventually displaying the following completion message. Publish Tables Complete Dialog

6. Click OK to return to the Teradata Warehouse Miner user interface. Now published analyses can be saved in the special metadata tables in the publish database.

Installing Advertise Output Metadata Before advertising output, special metadata tables must be created. The creation of these tables can be requested from the Tools > Advertise Tables Creation option in a manner similar to requesting creation of the Metadata or Publish Tables, indicating whether or not the tables should be created With Fallback or Without Fallback protection. Not that there is a third option to Replace Views and Macros Only so that the advertise views and macros can be updated if necessary without disturbing the underlying advertise metadata tables.

24 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 2: Installation and Configuration Configuration Advertise Tables Creation

The remaining dialogs are similar to those in Installing Publish Metadata.

Installing Support Tables and Functions Depending on the Teradata Warehouse Miner product, additional support tables and user defined functions may need to be installed in support of specific analytic functions.

Installing Statistical Test Tables

Note: Users of the Teradata Profiler, Analytic Data Set (ADS) Generator or Data Set Builder For SAS products may skip this section.

The Teradata Warehouse Miner product, and products that are built upon it, offer Statistical Test functions that require that additional tables be installed in Teradata. In order to install these Statistical Test Tables, you must execute the Load Teradata Stats Test Data program item in the Teradata Warehouse Miner program group. This program loads the data by locating the Teradata Client Utility FastLoad. If it exists on the system, you can proceed with the load. Otherwise, an error occurs. You will be prompted to enter the Hostname, Userid, Password and Database to load the data into. This requires an appropriate entry in the Hosts file to connect properly. See Executing Teradata Client Utilities. The FastLoad program is then invoked to load all of the Statistical Test Tables - the FastLoad log can be analyzed upon completion.

Note: Any existing copies of the Statistical Test Tables will be replaced during the execution of the Load Teradata Stats Test Data program.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 25 Chapter 2: Installation and Configuration Configuration Installing or Uninstalling PMML Scoring UDFs

Note: Users of the Teradata Profiler product may skip this section.

The Teradata Data Set Builder for SAS and Analytic Data Set (ADS) Generator products, and products that are built upon them, offer Predictive Model Markup Language (PMML) scoring functions that require that certain User Defined Functions (UDFs) be installed in Teradata. 1. Execute the Install or Uninstall UDFs program item in the Teradata Warehouse Miner program group. This program creates the UDFs by locating the Teradata Client Utility BTEQ. If it exists on the system, you can proceed with the load. Otherwise, an error occurs. 2. You will be prompted to select which UDF install or uninstall function to perform and to enter the Hostname, Userid, Password and Database (which should be the Metadata Database) to load the UDFs into. This requires an appropriate entry in the Hosts file to connect properly. See Executing Teradata Client Utilities. The BTEQ utility is then invoked to create all of the necessary UDFs.

Note: Any existing UDFs with the same names will be replaced during the execution of the PMML install function. 3. If the UDFs fail to install properly, even though the BTEQ utility, Hosts file entry, and proper database permissions are all in place, an alternative method of installation can be attempted. From the command line, change the current directory: cd \Scripts\UDF Install and Scripts 4. Issue the following command: InstallUdfs -c ASCII. This will lead to the prompts described above.

Installing or Uninstalling TD_Analyze UDFs

Note: Users of the Teradata Profiler or ADS Generator product can skip this section.

Certain functions or options require that the TD_Analyze external stored procedure and table operators tda_kmeans and tda_dt_calc be installed. To install these UDFs, you must execute the Install or Uninstall UDFs program item in the Teradata Warehouse Miner program group, selecting the option to “Install TD_Analyze Udfs.” This program creates the UDFs by locating the Teradata Client Utility BTEQ. If it exists on the system, you can proceed with the load. Otherwise, an error occurs. The options enabled by the installation of these UDFs include the Fast K-Means clustering option and the Gain Ratio Extreme decision tree option. See Installing or Uninstalling PMML Scoring UDFs for details about executing the Install or Uninstall UDFs program.

Installing or Uninstalling the Aster SQL-MR Library function (Aster Database) In order to use Descriptive Statistics on Aster, you must first install the Aster SQL-MR Library function Aster Profiler.

26 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 2: Installation and Configuration Configuration 1. Connect to the Aster system that you want to analyze. 2. Select Tools > Aster Profiler > Install. The ACT program is invoked to create the SQL-MR function. A dialog will display once the installation is complete. 3. Click OK.

Note: If a version of the Aster Profiler function is already installed and you want to install a new one, you must uninstall the existing one first using the Tools > Aster Profiler > Remove option.

Installing Teradata Warehouse Miner Demonstration Data Teradata Warehouse Miner demonstration data can be installed for tutorial purposes. This data is the same data used in the documentation and help system provided with the product. The tables comprising the demonstration data use less than 10 megabytes of permanent space and are built without FALLBACK protection. The table names begin with “twm_” to avoid interference with the names of other tables in the database. The demonstration table names listed below show that the data emulates customer, account and transaction information from a fictitious bank. 1. twm_accounts 2. twm_checking_acct 3. twm_checking_tran 4. twm_credit_acct 5. twm_credit_tran 6. twm_customer 7. twm_customer_analysis 8. twm_customer_dqa 9. twm_savings_acct 10. twm_savings_tran 11. twm_transactions In order to install the product demonstration tables, you must execute the Load Demonstration Data program item in the Teradata Warehouse Miner program group. This program loads the data by locating the Teradata Client Utility FastLoad. If it exists on the system, you can proceed with the load. Otherwise, an error occurs. You will be prompted to enter the Hostname, Userid, Password and Database to load the data into. Note that this requires an appropriate entry in the Hosts file to connect properly. See Executing Teradata Client Utilities. The FastLoad program is then invoked to load all the demonstration data - the FastLoad log can be analyzed upon completion.

Note: Any existing copies of the demonstration tables will be replaced during the execution of the Load Demonstration Data program.

Installing Tutorial Projects If the Teradata Warehouse Miner demonstration data has been installed, the tutorial examples described in the User Guide and Help System may be installed and executed to aid in learning about the product. Export

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 27 Chapter 2: Installation and Configuration Configuration files are provided that contain projects with these examples, are are in the Scripts/Tutorials path under the application folder where the Teradata Warehouse Miner product is installed. To install these tutorial projects, the user can select Help > Import Tutorial Projects and map the twm_source and twm_results databases to the desired databases. The following figure shows the Import Tutorial Projects dialog, with both the Aster and Teradata related tutorials noted in red and orange, respectively: Import Tutorial Projects

Aster Database Included in the SQL-MR Examples tutorial is a project named Aster Tutorials - SQL-MR Examples, which contains more than 60 analyses. Some of these SQL-MR analyses contain Run Unit scripts that call ACT. Those analyses must be updated to supply login credentials and point to the path on the user system where ACT is installed, as shown in the following figure:

28 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 2: Installation and Configuration Configuration Run Unit Properties

The Install Aster Example Data analysis can use ACT to install the Aster SQL-MR example data (if it is not already installed on your system) in order for the tutorials to function properly. See the Aster Analytics Foundation User Guide for details on each SQL-MR analysis.

Note: The SQL-MR Examples tutorial will take a few minutes longer to load than the other tutorials.

Executing Teradata Client Utilities In order to perform any of the installation tasks described above, it is necessary to execute a Teradata Client Utility (either FastLoad or BTEQ) on the client machine where Teradata Warehouse Miner is installed. This requires the addition of one or more entries in the Hosts file on this machine by a user with administrative privileges, one for every Teradata server where tables or user defined functions are to be installed. The Hosts file can typically be found one of the following locations, depending on the client operating system: • C:\WINDOWS\system32\drivers\etc • C:\WINNT\system32\drivers\etc For example, if a demonstration version of the Teradata database on the same client machine is the target, 127.0.0.1 dbc dbccop1 might be added.

Note: This line is entered without quotes and that the three items should be separated from each other by a tab character. Also, the first item is the IP address of the target machine which, in this example, is a special value indicating "local host”, and that the host name is dbc. The suffix cop1 is added to the host name.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 29 Chapter 2: Installation and Configuration Configuration Teradata Warehouse Miner ODBC Data Sources Teradata Warehouse Miner requires data sources to be defined in a certain manner. Either File, User, System or “Simple” (IP Address or Host Name) data sources can be used. A connection is only required when selecting data from the user interface, when an analysis executes or when result tables are displayed. When a data source is defined, the name of the Default Database is used as the Teradata Warehouse Miner Source Database. This is the Teradata database where the tables to be analyzed reside. This is only a default and can be changed on the Connection Properties dialog as described in Using Teradata Warehouse Miner. Similarly, the name given for the Username is used as the default value for the Teradata Warehouse Miner databases. • The Metadata Database is the Teradata database containing the Teradata Warehouse Miner metadata. • The Result Database will contain any result tables or views. • The Statistical Test Database contains the metadata tables that support Statistical Test functions. • The Publish Database contains the metadata tables that hold “published” analyses. • The Advertise Database contains the metadata tables that hold information about “advertised” output tables, views and procedures. These are also only default values and can be changed on the Connection Properties dialog. Once these Teradata Warehouse Miner databases are changed, this information is saved on a per Data Source basis. When you reconnect utilizing a particular data source, Teradata Warehouse Miner remembers the Source, Result and Metadata database settings from the prior connection.

Note: If the Teradata System that Teradata Warehouse Miner will be executed against is configured in ANSI mode, one additional setting needs to be changed when the DSN is defined. In this case, select Options and change the Session Mode from System Default or ANSI to Teradata.

Depending on the version of Teradata and the version of the ODBC driver for Teradata being used, the session character set requested on an ODBC data source may need to be set. See the Orange Book publication 541-0004068-C02 Getting Started: International Character Sets and the Teradata Database for more information.

Workload Management If it is desired to manage the Teradata Warehouse Miner database workload, the following points are worth noting. Teradata Warehouse Miner (and each of its derivative products) uses session pooling and therefore maintains more than one concurrent connection to the Teradata database through ODBC. Certain operations may require more than one connection, such as the creation of metadata, which may require 4 connections. There are different options for limiting the number of concurrent queries: • If using the Teradata Workload Manager utility, set the value for User Throttle for Teradata Warehouse Miner to 4 to allow the creation of metadata. • If using a Workload Class object, the setting to limit concurrent queries can be as low as 1; it should be possible to create metadata successfully. In this case, however, the expert options tab for each Data Explorer, Matrix and Logistic Regression analysis should be used to limit threads or concurrent queries appropriately. The Decision Tree, however, may require more than one concurrent connection.

30 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 CHAPTER 3 Using Teradata Warehouse Miner

Overview The Teradata Warehouse Miner front end is a graphical interface to the Teradata Warehouse Miner analytic components. Use the interactive front end to parameterize, run, graph and view results of the analytic components.

Main Menu Teradata Warehouse Miner Main Menu consists of the standard File, Edit, View, Window and Help menus available with most standard Microsoft Windows products. Additionally, there are Project and Tools menus described below.

File Menu The following options are available from the File (Alt-F) menu: • Add New Project — Create a new project, ready to contain a collection of available analyses. • Add Existing Project… — Add one or more existing projects to the project workspace. (for more information, See Creating or Opening a Project). • Close — Close the specified project. • Close All Projects — Close all projects that are open in the workspace. • Add New Analysis… — Add a new analysis to the project. • Add Existing Analysis … — Add copies of one or more existing analyses to the currently selected project. For more information, see Adding Analyses to a Project. • Save — Save the specified project and its contents. If another user has saved the same project in another session after the project was last saved or opened in this session, a message is given, along with an option to save a copy of the project, but it is not allowed to overwrite the original project in this case. • Save As... — Save the specified project and its contents under a new name. Note that this option creates a new project containing copies of all analyses within the current project. • Save and Archive — Save the specified project and its contents along with archiving the project to an export file in the standard automatic archive. The automatic archive facility is described in Metadata Maintenance, specifically in the portion describing the Archive button. • Save All Projects — Save all of the currently open projects and their contents. • Import… — Display the Import Wizard which is used to import the projects contained within a binary file created by the Export Wizard into the current Project Window. • Export… — Display the Export Wizard which is used to save projects to a binary file for possible retrieval on the same or different system using the Import Wizard.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 31 Chapter 3: Using Teradata Warehouse Miner Main Menu • Archive/Restore — The Archive/Restore options on the File menu are the same as those found in the Metadata Maintenance dialog when the Archive button is selected, minus the Export option. See Metadata Maintenance for more information. • Exit — End the program.

Import Wizard The Import Wizard allows the user to import the project or projects contained within a binary file created by the Export Wizard into the current Project Window. Alternately, if there is already a project in the Project Window that is currently selected, the analyses contained in the project or projects in the binary file can be imported directly into the currently selected project. In order to invoke the Import Wizard, the user must be connected to an ODBC data source. The metadata database associated with this data source (or the one currently specified in the Connection Properties dialog) is the metadata database in which the imported projects or analyses will be stored if the user chooses to save the current project or projects. The Import Wizard contains facilities for mapping input and output database, table and view names, as well as input column names to other values, based entirely on matching one name to another name. Depending on the type of analysis, some references to these database entities will not be remapped in this process, and may need to be changed in individual analyses after importing. For example, changing the name of an input column will not change references to the original column in a primary index or join path, and changing a database name may not change a reference to that database in the specification of a User Defined Function call. The Import Wizard is activated by selecting File > Import, bringing up an Open File dialog from which one or more import files (ending with .bin and created using the Export menu option) can be selected. After selecting the file or files to open, the following Import Wizard dialog is displayed. On this first dialog, the Import Wizard allows you to change the database names used in the imported analyses. It does this by determining the Available and Matched Databases as displayed below. Import Wizard

32 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Main Menu • Import Wizard — The header for this dialog indicates that it allows the user to Match Databases while giving a reminder of the name of the import file in this case, Import Demo. • Available Databases — This is a list of available databases associated with the ODBC connection. This includes the databases listed as Source Databases in the Connection Properties dialog available from the Tools menu, as well as the Result Database from the Connection Properties dialog. • Matched Databases — On the left side of the => symbol are all the database names acquired from the analyses within the imported binary file. On the right side of the symbol is the Available database that matches the imported database by name (not case sensitive). If a database is not matched, the “NOT MATCHED” phrase will show next to it and action must be taken to resolve the conflict. For example, in the screen above, the financial database does not match any databases listed in the Available Databases list, so NOT MATCHED is shown. The other databases, twm_source and twm_results, were matched and no further action is required.

Note: If a foreign database to be accessed via QueryGrid is present in this list (distinguished by the presence of a ‘@’ symbol in the name), the Next button leading to Table and Column matching is disabled. • Match button — Clicking the Match button will match the database selected on the left with the database selected on the right. The same result can be obtained by double-clicking on the desired database on the left to match with the selected database on the right. • New Name — The name of a matching database may be entered directly in this field and it will be matched with the currently selected database on the right. This removes the need to add a Source Database using the Connection Properties dialog just to allow its use as a matching database.

Note: A matched database can be unmatched by entering an empty name in this field with the Enter or Return key. • Use connection's Result Database for all output databases — Selecting this option overrides the value or values that output databases are mapped to and maps all output databases to the Result Database on the current Connection Properties dialog. This is particularly useful when a project that accesses and saves tables into the same database is imported into an environment where tables need to be created in a separate database.

Note: Tables used for both output and input will need to be adjusted individually after a project is imported in this manner. • Merge with selected project — By default, the project or projects in the import file are imported into a new project or projects. If there is a selected project in the Project Window, however, this option is enabled. Click it to import the analyses into the currently selected project. Even if the import file contains more than one project, all of the analyses contained in the projects in the import file are imported into the currently selected project in the Project Window if this option is selected. • Import button — Clicking the Import button results in all the projects and analyses in the binary import file being imported into the current Teradata Warehouse Miner Project Window.

Note: These projects and/or analyses are not saved in the current metadata database until the Save or Save All button is selected in the usual manner. Also, the Import button can be selected even though additional screens are available by selecting the Next button. A validation warning is given if there are any unmatched databases, tables or columns present, even if the screens to view all of these objects have not been viewed by selecting the Next button.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 33 Chapter 3: Using Teradata Warehouse Miner Main Menu • Cancel button — Clicking the Cancel button cancels the import operation without changing the current Project Window workspace. • Next button — Clicking the Next button takes the user to the screen next screen, to allow the user to map imported tables to tables with a different name. The resulting dialog is shown below, followed by explanations of the fields in the dialog. Note that if a foreign database to be accessed via QueryGrid is present in the Matched Databases list (distinguished by the presence of a ‘@’ symbol in the name), the Next button leading to Table and Column matching is disabled. Import Wizard: Map Imported Tables

• Import Wizard — The header for this dialog indicates that it allows the user to “Match Tables” while giving a reminder of the name of the import file (in this case, Import Demo). • Matched Databases — These are the matched databases from the previous screen. • Available Tables — These are the tables and views contained in the currently selected matched database in the Available Databases pull-down list. • Matched Tables — On the left side of the symbol are all the table names acquired from the analyses within the imported binary file. These may include the names of tables that are accessed strictly for input, tables that are created as output, and tables that are both created as output and read in as input (output/ input). For tables that are accessed strictly for input, the right side of the symbol identifies the Available table that matches the imported table by name (not case sensitive) and that resides in the database matched previously to the original database of the imported table. If an input table is not matched, NOT MATCHED appears next to it and action must be taken to resolve the conflict.

Note: It is not possible to match an imported table if the database it is imported from was not matched on the previous screen.

34 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Main Menu To make table matching more easy, when an item under Matched Tables is selected on the right, the database matched previously to the original database of the imported table is automatically selected, if possible, under Available Databases on the left and the corresponding tables are loaded under Available Tables. For tables that are created as output, they are labeled either as an “Output Table” or an “Output/Input Table” and matched by default to a table with the same name (i.e., the name is not changed). For example, in the screen above, the table “output1” is labeled as an Output/Input table, and “output2” is labeled as an Output table, both with the same name. The names of output tables can be changed either by matching or entering a New Name in the usual manner. • Match button — Clicking the Match button matches the table selected on the left with the table selected on the right. The same result can be obtained by double-clicking on the desired table on the left to match with the selected table on the right. Note, however, that table matching can only take place if both the selected table on the left and the selected table on the right are contained in the same matched database. • New Name — The name of a matching table may be entered directly in this field and it will be matched with the currently selected table on the right, provided the currently selected database under Available Databases on the left is the same as the database previously matched to the original database of the imported table selected under Matched Tables on the right. Note that a matched table can be unmatched by entering an empty name in this field with the Enter or Return key.) • Display qualified names — This option displays matched tables qualified by the name of their containing database. It is checked by default only if the same table name occurs in more than one matched database. • Display input only — This option displays only input tables under Matched Tables, excluding Output and Output/Input tables. • Merge with selected project — By default, the project or projects in the import file are imported into a new project or projects. If there is a selected project in the Project Window, however, this option is enabled. Click it to import the analyses into the currently selected project. Even if the import file contains more than one project, all of the analyses contained in the projects in the import file are imported into the currently selected project in the Project Window if this option is selected. • Import button — Clicking the Import button results in all the projects and analyses in the binary import file being imported into the current Teradata Warehouse Miner Project Window. These projects and/or analyses are not saved in the current metadata database until the Save or Save All button is explicitly selected in the usual manner. • Cancel button — Clicking the Cancel button cancels the import operation without changing the current Project Window workspace. • Next button — Clicking the Next button takes the user to the screen next screen, which is designed to allow the user to map imported columns to columns with a different name. The resulting dialog is shown below, followed by explanations of the fields in the dialog.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 35 Chapter 3: Using Teradata Warehouse Miner Main Menu Import Wizard: Map Imported Columns

• Import Wizard — The header for this dialog indicates that it allows the user to “Match Columns” while giving a reminder of the name of the import file (in this case, Import Demo). • Matched Tables — These are the matched tables from the previous screen, each displayed as qualified by its containing database. • Available Columns — These are the columns contained in the currently selected matched table in the Matched Tables pull-down list. • Matched Columns — On the left side of the symbol are all the input column names acquired from the analyses within the imported binary file. On the right side of the symbol is displayed the Available column that matches the imported input column by name (not case sensitive) and that resides in the table or view matched previously to the original table or view that contains the imported column. If an input column is not matched, the “NOT MATCHED” phrase will show next to it and action must be taken to resolve the conflict. Note that it is not possible to match an imported column if the table it is imported from was not matched on the previous screen. To simplify column matching, when an item under Matched Columns is selected on the right, the table matched previously to the original table of the imported column is automatically selected, if possible, under Matched Tables on the left and the corresponding columns are loaded under Available Columns. Note that only input columns can be changed using Matched Columns, not primary index columns, join columns or anchor columns. Therefore, only input columns that do not have these special uses should be matched to other columns. • Match button — Clicking the Match button matches the column selected on the left with the column selected on the right. The same result can be obtained by double-clicking on the desired column on the left to match with the selected column on the right. Note, however, that column matching can only take place if both the selected column on the left and the selected column on the right are contained in the same matched table. • Retain alias when column changes — Column aliases are retained when a column is mapped to the same column. When a column is mapped to a different column name, the alias is dropped unless this option is selected.

36 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Main Menu • Display qualified names — This option displays matched columns qualified by the name of their containing table and possibly their containing database also. It is checked by default only if the same column name occurs in more than one matched table. It displays database as part of the qualified names only if the same table name occurs in more than one database in the qualified matched column names. • Display unmatched only — This option displays only unmatched input columns under Matched Columns. • Merge with selected project — By default, the project or projects in the import file are imported into a new project or projects. If there is a selected project in the Project Window, however, this option is enabled. Click it to import the analyses into the currently selected project. Even if the import file contains more than one project, all of the analyses contained in the projects in the import file are imported into the currently selected project in the Project Window if this option is selected. • Import button — Clicking the Import button results in all the projects and analyses in the binary import file being imported into the current Teradata Warehouse Miner Project Window. These projects and/or analyses are not saved in the current metadata database until the Save or Save All button is explicitly selected in the usual manner. • Cancel button — Clicking the Cancel button cancels the import operation without changing the current Project Window workspace.

Note: If, when importing, any database, table or column is mapped to another value and a Free-Form SQL analysis is imported, a warning is given identifying the analysis by name, indicating that it should be checked for objects that may need to be manually mapped. This warning is given also for expert clauses in various type of analysis and for SQL text elements in Variable Creation analyses. If given, the warning identifies each analysis that contains freely entered SQL text.

Export Wizard The Export Wizard allows the user to save a binary file containing Teradata Warehouse Miner projects and analyses metadata to a hard disk. This metadata can then be imported to a different metadata database on a different Teradata system. To invoke the Export Wizard, open the project(s) that you wish to export. 1. Click on File > Export....

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 37 Chapter 3: Using Teradata Warehouse Miner Main Menu Export Wizard

In this case, two projects were open under Teradata Warehouse Miner: Help Tutorials - Free Form SQL and Help Tutorials - Publishing. If the project window contains multiple projects when Export is selected, none of the projects or their contained attachments and analyses is initially selected. 2. To export an entire project, click on the box next to the project name. Any number of projects can be saved to the same binary file. Optionally, select analyses can be exported. Note, however, that the parent project is always exported. Also, if an individual analysis is checked that contains input references to other analyses, the analyses referred to are automatically checked also. If there is only one project selected, that project is initially selected by default for export, along with everything it contains. 3. To ensure that projects are not saved to metadata prior to exporting selected items to a file, check the Export Without Saving First. When a connection to the database is lost, use of this option may make it possible to save work (so that it is not lost as well), making it available to retrieve later using the File > Import option. However, even under normal conditions it may save time to use this option.

Note: This option is not available when the Export Wizard dialog is requested from the Metadata Maintenance dialog, since exported projects do not need to be saved first in this case. 4. Click on Next to bring up the standard Windows Save dialog.

38 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Main Menu Export Wizard: Save

5. Type in the name of the binary file and click on Save to complete the export process.

View Menu The following options are available from the View (Alt-V) menu: • View — View the currently selected analysis. • View Advertised Output — View a summary of advertised objects, with options to view the detailed properties of an object, to view the data in an object, to view the object's SQL definition and to perform maintenance operations. For more information, see Advertised Maintenance and Advertise Output. Sub-menu items are displayed for the following options: ∘ View All Advertised Output ∘ View ADS Output Only ∘ View Score Output Only ∘ View Profile Output Only ∘ View Other Output Only • View Project Window On Right — Display the Project Area docked to the right of the main display. If it was previously hidden by closing the Project Area window, this restores it on the right. As an alternative, click the left-pointing directional button in the upper right portion of the main window to reopen the Project Window on the side it was previously displayed. • View Project Window On Left — Display the Project Area docked to the left of the main display. If it was previously hidden by closing the Project Area window, this restores it on the left. As an alternative, click the left-pointing directional button in the upper right portion of the main window to reopen the Project Window on the side it was previously displayed.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 39 Chapter 3: Using Teradata Warehouse Miner Main Menu • Execution Window — Turn on the Execution Status Area, if it had been previously closed. As an alternative, click the upward-pointing directional button in the upper right portion of the main window to re-open the Execution Window.

Project Menu The following options are available from the Project (Alt-P) Menu: • Run F5 — Run the currently selected analysis within the project window. Alternately, the function F5 key executes a selected analysis as well. Before an analysis is executed, each analysis that it may refer to as the source of its input is automatically executed first. Since a referenced analysis may refer to other analyses as well, executing an individual analysis may result in the execution of a series of analyses. • Run F5 — Run the currently selected project within the project window. Alternately, the function F5 key executes a selected project as well. The analyses in the project are executed in order, one by one, except that analyses referenced for input are always executed before the analysis that refers to them. • Run To End F6 — Run the currently selected analysis within the project window followed by each of the subsequent analyses until the end of the project. Alternately, the function F6 key executes a selected analysis and subsequent analyses as well.

Note: Execution order may be affected when an analysis refers to another analysis in the project for its input, in which case the referenced analysis is always executed before the analysis that refers to it. • Run Stand-Alone F7 — Run the currently selected analysis within the project window without executing any analyses that it may reference for input. Alternately, the function F7 key executes a selected analysis as well. This option may result in an error if an analysis referenced for input has not created the required input table or if a created volatile table is no longer available.

Note: This option may be ignored in the Teradata Profiler product. • Skip during Project Execution — This option can be used to skip the execution of an individual analysis when the project that contains it is executed, whether in whole or in part when the Run to End option is used. It does not, however, skip the execution of an analysis that is executed by itself or as the result of being referenced by another analysis that is being executed by itself. The option works by toggling a flag that is saved with the analysis and is honored both in subsequent interactive and batch executions. Skipping the execution of an analysis during project execution may be useful in many instances. For example, the option can be used to avoid rebuilding a matrix every time an algorithm that uses it is executed to build a model, or similarly to avoid rebuilding a model every time an analysis that scores it is executed. Similarly, it can be used to avoid executing a chain of analyses an extra time when they are being refreshed or published with the Refresh or Publish analysis, respectively. • Stop — Stop the currently executing analysis or project. • Add New Analysis… — Add a new analysis to the project. • Add Existing Analysis … — Add copies of one or more existing analyses to the currently selected project. For more information, see Adding Analyses to a Project. • Add Attachment … — Create an attachment folder within the specified project and copy any valid windows object into it, such as a document file.

40 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Main Menu • Delete — Delete the currently selected analysis or project within the project window. Before the analysis or project is deleted, a verification message is given that provides an opportunity to cancel the deletion. If the analysis to be deleted is referred to by another analysis for its input, the verification message indicates this fact. • Log Analysis Info — Information about the currently selected analysis that might be useful for support purposes is written to the information log. The log can be viewed by selecting the Tools > View Logs - View Info Log… option. • Extract SQL — This option extracts the generated SQL from the results of the currently selected analysis, or from all of a project's analyses if a project is currently selected, placing the SQL in a display so that it can be reviewed, edited, copied, written to a file or placed in a Free Form SQL test node. When the SQL from an entire project is extracted, headers are added as comments to identify the beginning of the SQL for each included analysis.

Note: The Extract SQL dialog can remain open and change display when a different project or analysis is selected in the project window. Further, the Write button can be used to create an SQL file for several projects or analyses successively without closing the dialog.

It is important to note that SQL is not extracted from the results of every analysis. There are several reasons why SQL might not be extracted from an analysis: 1. The analysis may not have been executed yet. 2. It may be a type of analysis that does not store SQL in its results, including Scatter Plot and Matrix analyses, analytic algorithms other than Association, and Scoring analyses that do not have the Score Only option selected. 3. When extracting SQL from an entire project, the SQL from an analysis that has been refreshed or published is not included in the extracted SQL. 4. When extracting SQL from an entire project, any analysis marked as Skip during Project Execution does not have SQL extracted. 5. When extracting SQL from an entire project, any SQL generated for a derived table, subquery or With query is not extracted because it is already included in its referencing analysis. 6. When extracting SQL from an entire project, any SQL generated by a Publish analysis is not included because it is not in an executable form (i.e., it includes special substitution tags). The following information is displayed by the Extract SQL dialog. ∘ Extracted SQL file — The full path of the file containing the extracted SQL is displayed here after the Write button has been selected and the file has been written. ∘ Number of statements extracted — The total number of SQL statements extracted. Note that some analyses generate more than one SQL statement. ∘ Number of characters extracted — The total number of characters in the extracted SQL, including the headers added to identify analyses when extracting SQL for an entire project. ∘ Analyses in Project — The total number of analyses in a project is displayed if and only if a project is currently selected. ∘ Analyses displaying SQL — The number of analyses which contained results SQL is displayed if and only if a project is currently selected. The following options are provided: ▪ Display default procedure CALL in place of procedure SQL — When this box is checked, an SQL CALL statement is displayed in place of the SQL to create a stored procedure. As close as

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 41 Chapter 3: Using Teradata Warehouse Miner Main Menu possible, the values of the parameters in the CALL statement are set equal to the values that would be in effect if the underlying analysis was executed without creating a procedure. This option is given for an analysis only if the analysis creates a stored procedure, and for a project only if the project contains at least one such analysis. Anchor table parameters may appear in a default procedure call as artificial “tags” (beginning with '_twm') if there is no replaceable, non-volatile anchor table. ▪ Edit SQL — When this check box is checked, the displayed SQL can be edited prior to writing, copying or testing. Using this option "locks" the dialog display so that it will not change if another project or analysis is selected. ▪ Word Wrap — When this check box is checked, the SQL is displayed in a way such that the statements have their text wrap to the next line if they do not fit into the window. ▪ Up and Down Arrows — The up and down arrows may be used to position the display to the previous or next analysis when the SQL for an entire project is displayed. ▪ Test Node — Selecting the Test Node button creates a Free Form SQL analysis containing the extracted SQL and marks the analysis as Skip during Project Execution. This can be used to test the displayed SQL. ▪ Write — Selecting the Write button leads to a dialog to locate, name and write to a file the SQL displayed on the form. This will include any programmatically added analysis headers or user edited changes. ▪ Right-Click Menu Options — The standard right-click menu options are provided in the SQL display area, including Undo, Redo, Cut, Copy, Paste, Delete, Find, Replace and Select All. Not all options may be enabled depending on whether the Edit SQL box is checked and whether there is displayed text and a history of actions to be undone or redone. • Create Results Files — This option enables the creation of tab-delimited files from the result sets of a selected analysis or project in the Project Area. A dialog is displayed to enable the selection of a destination folder for the files, along with options to write, view and export them to Excel.

42 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Main Menu Note: The Results menu offers an option to export results to Excel, so this option is simply a convenience to make it easier to keep track of created files and to create them for lots of analyses at once. Most types of analysis produce results from the SQL they generate and execute. However, here are some reasons why a results file may not be created for a particular analysis. ∘ The analysis may not have been executed yet. ∘ It may be a type of analysis that does not produce results data, particularly Scatter Plot and most Matrix analyses, and analytic algorithms other than Association. ∘ There is no results data from any executed analysis marked as “Generate SQL Only”. ∘ There is no results data from an analysis that has been published. ∘ There is no results data from a Refresh or Publish analysis. ∘ There is no results data from an analysis that is referenced as a volatile table. ∘ There is no results data when creating a stored procedure. Here are some other characteristics of this feature. ∘ If an analysis creates multiple results sets, they are all written to the same file with blank lines in between. ∘ Result sets with no rows are not written to a file. ∘ Result set columns are separated by an added tab character, and character data is enclosed in double-quotes. ∘ Each result set file name is taken from the name of the analysis that produced the result set, with illegal filename characters replaced with space characters. ∘ A warning is given if processing a project that contains analyses with duplicate names. ∘ A warning is given if any of the file names about to be created already exist in the destination folder. ∘ An option is provided to export all of the files or those that are selected to Excel. Each file becomes a separate window in the work area. • Create SQL Node to Drop Tables — This option creates in the current project a Free Form SQL analysis containing statements to drop every table and view created in the currently selected project or analysis. • Create SQL Node to Collect Statistics — This option creates in the current project a Free Form SQL analysis containing statements to collect statistics on the primary index of every table created in the currently selected project or analysis.

Note: If an output table does not exist, the primary index columns will not be known and the SQL to collect statistics on that table will be incomplete, which will be noted in a comment after the command. • Properties — The Properties dialog may be displayed for the currently selected project or analysis by selecting this option, as described below. When a project is selected in the project window, the Properties option may be selected from the Project menu or by right-clicking on the project to display the Project Properties dialog. When an analysis is selected, the Analysis Properties dialog may be similarly displayed, with the differences noted in the descriptions that follow. Note that the same Properties display can be requested from the Metadata Maintenance screen, available from the Tools menu.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 43 Chapter 3: Using Teradata Warehouse Miner Main Menu Note: The Properties dialog can remain open and change display when a different project or analysis is selected in the project window. Further, the Apply button can be used to enter a Description for several projects or analyses successively without closing the dialog. The Analysis Properties dialog displays a Name and Type field instead of Project Name. It also displays a third tab, References, that provides an outline of analysis references and dependencies. Description — The description of the project or analysis may be entered here. Usage Info tab — The Usage Info tab displays information about the creation, modification and execution of the project or analysis. Parameters tab — The Parameters tab, when selected for an analysis, contains the same information which would be logged to the Info Log file if the Log Analysis Info option were to be selected from the Project menu (or by right-clicking on an analysis in the Project window and selecting the same option). When selected for a project, summary information is displayed. References tab — The References tab outlines analysis references and dependencies and is available only when displaying the Analysis Properties dialog. Here is a summary of relationships outlined on this tab: ∘ References for input — This indicates that the listed analyses are referenced for input using the analysis references feature or a facsimile of it (as is the case with the Free Form SQL analysis). Note that the display indicates not only referenced analyses, but any analyses that these analyses may reference, or that these analyses may reference, and so forth. Indention is used in the display for this purpose. ∘ Is referenced by — Indicates that the output of this analysis is referenced for input by the listed analysis. In the case where Refresh or Publish is the referencing analysis, it might be said rather that the generated SQL is referenced rather than the output. Only one level of references is displayed for this relationship. ∘ Depends on analysis — Indicates that this analysis depends on the indicated analysis in some way other than using the output table or view as input. This case includes an algorithm dependent on a matrix, or a scoring analysis dependent on an algorithm. ∘ Is depended on by — Indicates that this analysis is depended on by the indicated analysis in some way other than an output table or view being used as input. This case includes an algorithm dependent on a matrix, or a Scoring analysis dependent on an algorithm. • Display names of output tables, views and procedures… — When this option is checked, the dialog expands to list these objects. Note that each type of object is distinguished by a different icon. Note also that in the case of Project Properties, this extended display lists the potential output tables, views and procedures created by all of the analyses contained in the project. • Show Object — This button attempts to display the definition of the selected table, view or procedure, provided that only one object is selected.

Note: This button is only available when connected to a Teradata system. • Drop Objects — This button attempts to delete the selected tables, views and/or procedures from the database containing them.

Note: On a Teradata system, the names are displayed in the format Database.Table, Database.View or Database.Procedure. On an Aster system, the names are displayed in the format Schema.Table, Schema.Table or Schema.Procedure.

44 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Main Menu The box displaying the object names changes to indicate a status and possible error message. The status will indicate whether or not the selected table or view was DROPPED or NOT FOUND (it may never have been created or it may have subsequently been dropped), or whether an ERROR occurred, in which case an error message is also displayed. Note that the columns in this display may be resized by dragging the vertical bars in the column header with the mouse. • Collect Statistics — This button attempts to collect statistics on the primary index of the selected tables. Note that statistics cannot be collected on a view or procedure. The box displaying the object names changes to indicate a status and possible error message. The status will indicate whether or not the selected table statistics were COLLECTED, the table was NOT FOUND, the item was SKIPPED (due to its not being a table or due to a failure to determine the primary index), or whether an ERROR occurred, in which case the error message is also displayed. Note that the columns in this display may be resized by dragging the vertical bars in the column header with the mouse. • OK/Cancel/Apply — The OK button accepts any changes that may have been made to an editable field and closes the dialog. The Cancel button discards any changes that may have been made and likewise closes the dialog. The Apply button is only enabled if changes have been made and simply saves the changes and keeps the dialog open.

Note: When this dialog is displayed from within Metadata Maintenance, the Cancel and Apply buttons do not appear, because no fields may be edited in this case.

Tools Menu The following options are available from the Tools (Alt-T) menu. • Change Connection… — Bring up the ODBC connection dialog to specify which system this session of Teradata Warehouse Miner will connect to. This is described in detail below. • Connection Properties — Bring up the Connection Properties dialog to display and/or change connection, project and analysis properties. This is described in detail below. • Metadata Maintenance… — Bring up the Metadata Maintenance dialog to maintain the projects and analyses stored in the current metadata database. This is described in detail below. • Publish Maintenance… — If not using the Teradata Profiler product, bring up the Published Models dialog to maintain the analytic data sets and models stored in the current Publish Tables that pass information to the Teradata Model Manager application. This is described in detail below. Note, however, that the Teradata Profiler product does not include publishing related functions. • Advertise Maintenance… — Bring up the Advertise Maintenance dialog to maintain the output advertisements in the Advertise Output metadata tables in the current advertise database. This is described in detail below. • Metadata Creation — The metadata tables can be created either With Fallback or Without Fallback. See Installing Metadata for more information. • Publish Tables Creation — If not using the Teradata Profiler product, the metadata tables that support the publishing of models for the Model Manager application can be created either With Fallback or Without Fallback using this option. See Installing Publish Metadata for more information. Note, however, that the Teradata Profiler product does not include publishing related functions. • Advertise Tables Creation — The Advertise Output metadata tables can be created either With Fallback or Without Fallback. When either of these options is selected, the metadata tables, views and macros for

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 45 Chapter 3: Using Teradata Warehouse Miner Main Menu advertising output are created in the Advertise Database, as specified on the Connection Properties dialog (replacing any previous tables, views and macros with the same names that might already exist there). An additional option to Replace Views and Macros Only is offered to replace just the views and macros without affecting the underlying metadata tables. • ODBC Administrator… — Bring up the ODBC Administrator to add, delete or modify ODBC data sources. • View Logs — Sub-menu items to View Error Log, View Execution Log or View Info Log are offered. The Error Log, Execution Log and Info Log can be requested to help provide support for the product if the information is not considered sensitive. ∘ The Error Log contains information about errors that have occurred during execution, including SQL errors. Information logged to the Error Log typically includes the SQL that failed, if appropriate, and information about the parameters in effect for the analysis that failed, if appropriate (i.e., the same information logged by selecting Project > Log Analysis Info or by right-clicking on an analysis in the Project window and selecting the same option). ∘ The Execution Log includes all messages generated by the Teradata Warehouse Miner analyses that are displayed in the Execution Status area. Each entry includes the name of the Analysis, the Status and the Message. ∘ The Info Log includes information about the last three times the application has been executed, including the versions of various supporting software components and the options used in ODBC connections. • Delete Logs — Sub-menu items to Delete Error Log, Delete Execution Log or Delete Info Log are offered. The Error Log accumulates error messages until deleted, as does the Execution Log. The Info Log keeps information about only the last three executions of the product, so is limited in its growth. • Preferences — Bring up the Preferences screens as described in detail below. ∘ Change Connection... ∘ Connection Properties ∘ Metadata Maintenance ∘ Publish Maintenance ∘ Model Properties ∘ Advertised Maintenance ∘ Advertised Object Properties ∘ Preferences

Change Connection... The Change Connection dialog lets you specify which system this session of Teradata Warehouse Miner will connect to. Selecting this option brings up the Teradata Warehouse Miner custom Select Data Source dialog.

46 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Main Menu Select Data Source

• System Data Source — A list of all the System data source names, as defined by the ODBC Administrator. • User Data Source — A list of all the User data source names, as defined by the ODBC Administrator. • File Data Source — A list of all the File data source names, as defined by the ODBC Administrator. • Simple Data Source — Specify either a Host name or IP address and optionally a Character Set to use (selected from the list supplied or entered manually). You will automatically be prompted for a valid userid and password.

Connection Properties The Connection Properties dialog allows you to display and/or change connection properties. The dialog contains the following tabs: • General • Databases/Schemas • Servers These tabs are discussed in the following sections.

General This tab provides information about the current Teradata ODBC connection, including the data source name, user name associated with the connection, default database of the current connection and the version of Teradata on the connected system.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 47 Chapter 3: Using Teradata Warehouse Miner Main Menu Connection Properties: General tab

A text box is provided for entering any desired pre-execution commands. This is provided primarily to allow specification of the calendar to use for a session (for example, set session calendar=ISO;). The pre- execution commands, separated by semi-colons if more than one are specified, are executed once before a project is executed, once before a chain of referenced analyses is executed, or once before an individual analysis is executed. See Calendar Support (Teradata Database) for more details about this feature, especially regarding its effect on the stored procedure output option, executing algorithms and scoring, and publishing to the Model Manager application.

Aster Database When connected to an Aster Database, the Analytics version installed on that system can be specified in order to use the correct argument names for SQL-MR functions.

48 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Main Menu Connection Properties: General tab (Aster)

• Analytics Version — Select the desired Analytics version from the drop-down menu. By default, AA 6.20 or earlier is selected since the argument names for AA 6.20 or earlier are valid for AA 6.21 or later. However, if you have AA 6.21 or later installed, then you can set that as your Analytics version here.

Databases/Schemas This tab is where Source Databases are defined and Result/Metadata Databases changed. Once changes are made, they are saved for that particular data source for future connections.

Note: This tab is labeled Databases when connected to Teradata, and Schemas when connected to Aster, based on each system’s nomenclature for data sources. Consequently, references to data sources in this menu refer to Databases when connected to Teradata and Schemas when connected to Aster.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 49 Chapter 3: Using Teradata Warehouse Miner Main Menu Connection Properties: Databases tab

The Databases/Schemas tab contains the following fields: • Source Databases/Schemas — By default, the Source Database/Schema is the Default Database/Schema defined in the Teradata ODBC data source. This is the data source where the tables to be analyzed exist. Other data source can be added by clicking Add. Similarly, data source can be removed by clicking Remove. To sort the data source alphabetically, click Sort; otherwise, they will be displayed both here and in the pull-down of the database/schema selector on the Analysis Input screen in the order they are added. • Result Database/Schema — The name of the data source specified as the Result Database/Schema. By default, this is equivalent to the Userid defined in the ODBC data source, but it can be modified here. This is the data source where Teradata Warehouse Miner will build result tables/views. • Save to Local Files – This feature redefines the way projects are saved from and loaded into the Project Explorer work area. When this option is checked, requests to save or load projects to/from metadata are satisfied by exporting/importing projects into a folder on the client machine. This allows a user with no permanent space allocated in the data source to save and retrieve projects locally (typically at a considerably faster rate). However, there are the following restrictions to this feature: ∘ This feature is not available in batch processing (i.e., a Metadata Database is required). ∘ The Add Existing Analysis function is not available with this feature. ∘ Metadata Maintenance is not available with this feature. ∘ The standard Archive function is not available with this feature (but Auto-Archive is).

50 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Main Menu See TWM Data Mining Projects for more information. • Metadata Database/Schema — The name of the data source specified as the Metadata Database/ Schema. By default, this is equivalent to the Userid defined in the ODBC data source, but it can be modified here. This is the data source where the Teradata Warehouse Miner metadata has been built. • Statistical Test Database/Schema — The name of the data source specified as the Statistical Test Database/Schema. By default, this is equivalent to the Userid defined in the ODBC data source, but it can be modified here. This should be set to the data source where the Teradata Warehouse Miner Statistical Test Metadata has been built using the Load Statistical Test Metadata program item. • Publish Database/Schema — The name of the data source specified as the Publish Database/Schema. By default, this is equivalent to the Userid defined in the ODBC data source, but it can be modified here. This is the data source where the Teradata Warehouse Miner metadata supporting the Publish analysis will be built on request (see the Tools > Publish Tables Creation menu option ). It is also the TWM publish DB referred to in the Teradata Warehouse Miner Model Manager User Guide, B035-2303, in the section Installing Teradata Model Manager under the step to Create Metadata Database. This step grants all privileges on this data source to the Model Manager user. • Advertise Database/Schema — The name of the data source specified as the Advertise Database/ Schema. By default, this is equivalent to the Userid defined in the ODBC data source, but it can be modified here. This is the data source where the Teradata Warehouse Miner Advertise Output Metadata tables, views and macros are created (see the Tools > Advertise Tables Creation menu option ). • Always Advertise — When checked, the Advertise Output option is assumed to be on for all projects and analyses executed. When this is the case, the corresponding check box on individual output panels is disabled and assumed to be checked. For more information, see Advertise Output. Click OK to save changes and exit, or Cancel to exit without saving any changes.

Servers This tab is where AppCenter servers may be defined. They may be added, removed or edited by selecting the appropriate buttons on this dialog: • Edit — When a server name is displayed and selected in the list in the upper part of this tab, clicking the Edit button leads to the Edit Server Dialog where the properties of the selected server may be changed. • Add — Clicking this button leads to the Add Server Dialog where a new server may be defined. • Remove — Clicking this button removes the selected server after a warning message is given.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 51 Chapter 3: Using Teradata Warehouse Miner Main Menu Connection Properties: Servers tab

Add/Edit Server Dialog If either the Add or Edit buttons are clicked with a particular server name highlighted, the following screen is displayed:

52 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Main Menu Connection Properties: Servers tab > Edit Server Dialog

• Server Name — The state of the Server Name field depends on which version of this dialog has been invoked: ∘ Add Server Dialog — The Server Name field must be entered by the user. ∘ Edit Server Dialog — The selected Server Name is displayed in read-only fashion and cannot be changed. • Description — The Description field is optional. • Host ID — The host ID is the URL (without the "https://" or "/login") or IP address of the Teradata AppCenter server. For example, "appcenter-prod.ac.uda.io". • User ID — The user ID for logging into the Teradata AppCenter web application. • Password — The Password field is optional. ∘ If a password is supplied, it is stored in an encrypted form in the Windows Registry and is used along with the Host ID and User ID to log onto an AppCenter server. ∘ If a password is not supplied, the user is prompted for the password when publishing to an AppCenter server. Click OK to save changes and exit, or Cancel to exit without saving any changes.

Metadata Maintenance This option brings up the Metadata Maintenance dialog to maintain the projects and analyses stored in the current Metadata Database, as shown below.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 53 Chapter 3: Using Teradata Warehouse Miner Main Menu Metadata Maintenance

• Title — The title indicates the database containing the metadata tables with the metadata to be maintained. • Projects — When the Projects folder is highlighted in the view on the left, it causes a summary of projects contained in the metadata to be displayed in the view on the right, with the total number of projects and/or folders indicated in the header of this view. By highlighting one or more projects in the view on the right and selecting either the Properties, Add, Edit, or Archive button, the appropriate dialog, described below, is displayed with the selected projects shown in the initial view. Note that the Properties button is enabled only if a single project or folder is selected in the view on the right. • Analyses — When a project is highlighted underneath the Projects folder in the view on the left, the analyses contained in that project are summarized in the view on the right, with the total number of analyses in this project indicated in the header of this view, as shown below. The second column in this display indicates the sequence number of the analysis within the project. Note that a special category labeled (Analyses without a Project) may be displayed underneath the Analyses folder in the view on the left if there exist analyses which are not part of any project in the metadata. This should typically only be the case if there were projects deleted in a version of Teradata Warehouse Miner earlier than version 5.0 and these projects contained analyses. When highlighted, the analyses without a project will be summarized on the right.

54 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Main Menu Metadata Maintenance: Analyses In A Project

• Properties — When a single project or analysis is selected in the view on the left or right, the Properties button may be selected to display the Project Properties or Analysis Properties dialog, as described in Project Menu. Note, however, that when displayed inside Metadata Maintenance, the Description field may not be used to enter a description for the project or analysis, but only to view the current description. Also, within the Metadata Maintenance dialog, the display of a project or analysis description is limited to 512 characters, even though the length is not similarly limited when entered or viewed in the Properties dialog when accessed from the Project window. If the Parameters tab, the References tab, or the Display names of output tables… option is selected on the Properties display, the selected project or the project containing the selected analysis is loaded behind the scenes from metadata in order to determine the requested information. If this occurs, then the shading of referenced analyses is supported for that project if the analyses are displayed in the view on the right, but not otherwise. • Add — When a single project is selected in the view on the left, or when one or more projects or analyses are selected on the right, the Add button may be used to add the projects or analysis copies to the project workspace from metadata. In particular, when one or more analyses are selected, the Add button leads to a menu of choices to add copies of the analyses either to the current project or a new project, and with or without database object mapping (after the manner of the Import Wizard). Edit — The Edit button may be used to request the following options: ∘ New Folder — One or more new folders may be added to the main projects folder on the left side, or to any folders previously added to the projects folder. When highlighted on the left, the folders and/or projects contained in a given folder are displayed on the right. When a folder on the right side is double-clicked, it expands to show the folders and/or projects it contains. ∘ Rename — The Rename option is available only to rename folders underneath the main projects folder on the left side display. Such folders may also be renamed by clicking on the name to the right of the folder. ∘ Delete — See Edit-Delete below. ∘ Expand All — Expands all of the analyses and projects in the left pane of the Metadata Maintenance dialog.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 55 Chapter 3: Using Teradata Warehouse Miner Main Menu ∘ Collapse All — Collapses all of the analyses and projects in the left pane of the Metadata Maintenance dialog. • Edit-Delete — The delete option is available to delete one or more projects or analyses displayed on the right side, or a single project or project folder selected on the left side. When a project folder is deleted, all of the projects or folders it contains are also deleted. Selecting Edit > Delete leads to the dialogs described below. ∘ Deleting Projects — In the example below, options are checked to delete both the displayed projects and their analyses and any tables or views created by the analyses. Metadata Deletion

These options may also be used to delete projects and analyses without tables or views, or to delete created tables and views without deleting projects and analyses. When Next is selected, the following screen is displayed.

56 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Main Menu Metadata Deletion: Confirm Project Deletion

If one or more items is selected in either the upper or lower view of this display (or both), the appropriate Preserve button or buttons will be enabled and may be used to remove items from the lists of items to delete, thus “preserving” them. If the Delete button is selected, the listed projects, tables and/or views are deleted, if possible. Projects are deleted from metadata, along with all of the analyses they contain. Tables and/or views are deleted from the database containing them (note that the names are displayed in the format database.table or database.view). If the Auto-Archive deleted items option is checked, the items are archived to the standard archive used by the Auto-Archive feature prior to deletion. The default value of the check box is determined by the user preference option for automatically archiving deleted items. For more information about the archive/restore feature and automatic archiving, see the Archive bullet below. After selecting Delete, each view changes to indicate a status and possible error message. The status for the lower view will indicate whether or not the table or view was DROPPED, NOT FOUND (the tables or views may never have been created or they may have subsequently been dropped), or whether an ERROR occurred, in which case the error message is also displayed. Note that the columns in these displays may be resized by dragging the vertical bars in the column header with the mouse. To exit this final display, click Finish. ∘ Deleting Analyses — When one or more analyses is selected on the Metadata Maintenance dialog and the Delete… button is selected, the following dialog appears.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 57 Chapter 3: Using Teradata Warehouse Miner Main Menu Metadata Deletion: Analysis Deletion Options

If Next is clicked, a Confirm Analysis Deletion dialog, similar to the Confirm Project Deletion dialog shown previously, appears. Note that in the Analysis Deletion Options dialog above, there is an indented check box giving the option to delete any analyses that reference these analyses. This includes analyses that reference these analyses for input or that reference a matrix or model created by these analyses. If this option is checked, and such referencing analyses exist in the containing project, the referencing analyses will be included on the Confirm Analysis Deletion dialog. This option is available, however, only if the analyses to delete are all contained in the same project. • Archive — The Archive/Restore features utilize the Export and Import functions, respectively, along with an Archive Record File and logic to control the allocation and naming of multiple export files. These features can be useful in moving metadata to another system or database, recovering from the loss of metadata tables or recovering from the loss of an individual project, analysis or attachment. An automatic archive feature is also available to request the archiving of a project, analysis or attachment whenever one is deleted, or by request when a project is saved. A preference option controls archiving upon deletion, and archiving when saving is achieved with the Save and Archive menu option available on the File menu and on the Project Area menu). When archiving projects, the user names and chooses a location for the archive record file, and selects the number of projects to be written to each export file (from 1 to 25). If one project is written to each export file, an option is provided to use the project names to name the export files. Otherwise, the archive record F\file name is used with the addition of an incrementing integer value. When restoring projects, the user may select the individual projects to be restored using the Import function. Options are provided so that projects may be restored with or without using the Import function’s database mapping screen. The default is to use database mapping. The Archive button leads to a menu with the following options:

58 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Main Menu • Export to a single file — This option leads to the standard Export function as described in Export Wizard along with other File menu options. Note that when a project is selected in the view on the left, a single project is exported. When one or more projects are selected in the view on the right, one or more projects are exported. When a single analysis is selected in the view on the right, the project containing it is exported with only that analysis selected on the Export form, along with any analyses it refers to or is dependent on. • Archive to one or more files — Options are provided for the number of projects per export file and use project names for export file names (available only when the number of projects per export file is set to 1). Clicking Archive leads to a dialog to locate and name the archive record file, after which warning messages may be given to indicate that the file already exists and/or that export files about to be created already exist. The exports are then performed, displaying the status of each in the File column of the display. In particular messages are given indicating that a particular project is loading, has loaded or is now contained in the indicated file.

Note: Projects must first be loaded into memory, not the project workspace, before the exports can be performed.

After the operation is complete, click View Record to open the actual archive record file for inspection, but the contents should not be altered. Viewing the record file should not be necessary but might be useful to more easily view error messages if any should be present. • Restore projects from archive — Selection of this option leads to a dialog to locate the desired archive record file. After a file is selected, the Restore Projects form appears, showing the archived projects and the files that contain them. The projects to be restored may then be selected manually or by clicking Select All near the bottom of the form. Clicking Restore brings up the Import Wizard, provided that the option to Import projects with database mapping is selected. If the option to Save projects directly to metadata is selected, however, a warning message is given to use caution with this option. Note that if a project is selected to be restored, all projects with the same name in the archive file that contains it will also be restored. Other buttons are provided, including an Open button for accessing another archive record file and a Delete button for deleting export files listed in the archive record file. Note that if the Delete button is selected, all of the projects for a selected file are selected to indicate that part of a file may not be deleted. Also, if all the files in an archive are selected for deletion, the archive record file is also deleted. • Restore auto-archived projects — The form appears when this option is selected is almost the same as that displayed when the Restore projects from archive option is selected. One difference is that the automatic-archive record file AutoArchive.txt is automatically located in the application file storage area and opened (if it exists), and thus there is no Open button. Also, additional columns are displayed, notably Item and Archive Date. The Item column indicates an analysis or attachment name if an individual item was archived automatically, or otherwise it is blank if an entire project was archived.

Note: The standard archive always works at the project level; it is only the automatic archive that will archive an individual analysis or attachment, as it may do when an analysis or attachment is deleted. • Delete all auto-archive files — This option removes the automatic-archive record file and all of the archive files it lists. In order to delete a subset of automatic-archive export files, click Delete on the Restore auto-archived projects form instead. • Set auto-archive options — This option is provided for convenience as an alternate way to access the Archive tab on the Preferences dialog, normally accessed from the Tools menu. The Archive tab contains an option to Automatically archive deleted projects, analyses and attachments.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 59 Chapter 3: Using Teradata Warehouse Miner Main Menu Publish Maintenance This option brings up the Published Models dialog to maintain the analytic data sets and scored models stored in the current Publish Tables that pass information to the Teradata Model Manager application. When a Publish analysis publishes a scoring or analytic data set analysis, a model is created that Teradata Model Manager can use to build a score table and/or analytic data set. If a published analysis is a scoring analysis that references an analytic data set analysis, both an analytic data set and a score table are published as part of the model. The Published Models dialog displays the following fields for each published model. • ID — An internal identifying number for the published model. • Name — The name of the model, assigned by the user when publishing the model. The Name may be up to 128 characters in length. • Description — An optional description of the model, assigned by the user when publishing. The Description may be up to 512 characters in length. • Version — A version number or identifier, assigned by the user when publishing. The Version may be up to 10 characters in length. • Published Date — The date that the model was published. • Expiration Date — The date that the model will expire and not be visible in the Teradata Model Manager application. When a model expires it remains in the metadata until deleted. The status of an expired model is displayed as Expired and cannot be modified on the Model Properties dialog. • Status — The status may be Active, Inactive or Expired. Expired is displayed only if the expiration date equals or precedes the current date. If the status is Expired, the status cannot be changed on the Model Properties dialog unless the Expiration Date is changed first. Otherwise, the status may be changed from Active to Inactive or vice versa. Any model that is not Active will not be visible in the Teradata Model Manager application. • ADS — The number of instances of an analytic data set that has been saved for this model by the Teradata Model Manager user. Each instance is visible on the Model Properties dialog by using the selector arrows. Note that the value will be zero if the model represents an analytic model for scoring without an analytic data set. • Score — The number of instances of an analytic model for scoring that have been saved for this model by the Teradata Model Manager user. Each instance is visible on the Model Properties dialog by using the selector arrows. Note that the value will be zero if the model represents an analytic data set without an analytic model for scoring. The Published Models dialog provides the following options. • Properties — This button leads to the Model Properties dialog described below in detail. • Edit ∘ Delete from Metadata — Selecting this option leads to the Metadata Deletion Wizard described previously in Metadata Maintenance. The first screen of the wizard is titled Delete Models dialog and the options are to Delete the models above from metadata and Delete any ADS or Score tables created by these models. The second screen is titled Confirm Model Deletion and it does not contain the option to Auto- Archive deleted items. Otherwise, the operation is similar and provides in this case the ability to remove models from the publish metadata (as opposed to deleting projects or analyses from the project metadata) as well as the ability to delete the tables created by these ADS and/or Score models in the Teradata Model Manager application (as opposed to tables created by executing the projects themselves).

60 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Main Menu ∘ Mark as Inactive — Selecting this option marks all of the selected models as inactive. Note that this option is not available if any of the selected models have expired. ∘ Mark as Active — Selecting this option marks all of the selected models as active. Note that this option is not available if any of the selected models have expired. ∘ Set Expiration Date — Selecting this option leads to a form to select an expiration date and apply it to all of the selected models. By default, an expiration date 3 months beyond the current date is selected.

Model Properties The Model Properties dialog is accessed by clicking Properties on the Published Models dialog. It displays model properties and variables, ADS properties and literal parameters, and Score properties and columns for the currently selected model. Note that a model may contain one or more instances of an analytic data set (ADS), one or more instances of a score table, or one or more instances of both. The first instance of either is created when an analysis is published. Additional instances are created each time a Teradata Model Manager user saves a model’s parameter settings. The Model Properties dialog displays the following fields for each published model. Note that the model name, description, version, status and expiration date may be modified using this dialog*. • Model Properties ∘ Name* ∘ Description* ∘ Version* ∘ Status* ∘ Published Date ∘ Expiration Date* • Model Variables — These are the columns in the analytic data set if the model contains one. • ADS Properties ∘ Anchor Database and Table — The anchor table determines the key values included in the data set. ∘ ADS Database and Table — The analytic data set table in the database. ∘ Target Date — An optional parameter used in time sensitive data set variables. ∘ Last Executed — The time that the data set was last built by the Teradata Model Manager. ∘ Published -or- Executed/Updated — For the first instance of ADS properties, this field displays the date and time that the ADS was published. For subsequent instances, this field displays the date and time that this ADS instance was either built or had its parameters updated by the Teradata Model Manager. ∘ Seconds — If this instance of ADS properties is not the first instance and represents the building of the ADS by the Teradata Model Manager, this field indicates the execution time in seconds. ∘ Result — If this instance of ADS properties is not the first instance and represents the building of the ADS by the Teradata Model Manager, this field indicates either successful completion or any resulting error message. It may also indicate that parameters were updated by a particular user. • ADS Parameters — These are the optional literal parameters associated with the Variable Creation or Variable Transformation that contributed to building the analytic data set. The values displayed are either the original values at the time the model was published, or the values saved by a user of the Teradata Model Manager application after they were altered.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 61 Chapter 3: Using Teradata Warehouse Miner Main Menu ∘ Name — The name of the literal parameter. ∘ Type — The data type of the literal parameter, either String, Number, Date, Time, Timestamp or Text (free-form SQL text). ∘ Value — The value of the literal parameter. ∘ Fixed — This indicator field is set to ‘0’ if the parameter can be changed in the Teradata Model Manager application, or ‘1’ if the parameter is fixed and cannot be changed there. ∘ Description — The optional description of the literal parameter. • Score Properties ∘ Input Database and Table — The name and location of the input table that was scored. ∘ Output Database and Table — The name and location of the output score table. ∘ UDF database — The database that contains any User Defined Functions used in scoring. ∘ Last Executed — The time that the score table was last built by the Teradata Model Manager. ∘ Published -or- Executed/Updated — For the first instance of Score properties, this field displays the date and time that the Score SQL was published. For subsequent instances, this field displays the date and time that this Score table instance was either built or had its parameters updated by the Teradata Model Manager. ∘ Seconds — If this instance of Score properties is not the first instance and represents the building of the Score table by the Teradata Model Manager, this field indicates the execution time in seconds. ∘ Result — If this instance of Score properties is not the first instance and represents the building of the Score table by the Teradata Model Manager, this field indicates either successful completion or any resulting error message. It may also indicate that parameters were updated by a particular user. • Score Columns — These are the columns included in the score table if one is included in the model, along with their optional descriptions. They typically include such items as key values, predicted values, probabilities or other evaluation measures and any columns carried through from the score input table. Note that the columns may be displayed in a desired order by clicking on the display column headers.

Advertised Maintenance This option on the Tools menu brings up the Advertised Objects dialog to maintain the information about advertised output tables, views and procedures stored in the current Advertise Output metadata tables. Advertised Maintenance can also be accessed from the View > View Advertised Output menu option. The Advertised Objects dialog displays the following fields for each advertised object. • Database — The name of the database containing the advertised object. • Name — The name of the advertised object. • Kind — The type of database object (T-Table, V-View, P-Procedure). • Created By — The database user name that created the object. • Created — The date and time that the object was advertised. • Comment — The database comment, if any, that was associated with the advertised object. • Category — The output category of the advertised object: ADS (data set), SCORE (score table), PROFILE, 0 - other. • Note — An optional note or comment associated with the advertisement of an object (for example, to indicate a user project). The maximum size is 30 characters. • Status — The output status of the advertisement: ∘ Missing: not found in the Teradata data dictionary ∘ Outdated: the data dictionary entry indicates the object was created after the date and time in the Advertise metadata's Database Objects table

62 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Main Menu ∘ blank This field is set by clicking Synchronize, described below. • Application — The name of the application that advertised this object. This field is blank if advertised by Teradata Warehouse Miner or one of its derivative products. The Advertised Objects dialog provides the following options. • Properties — This button brings up the properties dialog, described below. • Macros — This button brings up a dialog that allows the selection of any of the supplied Advertise Output macros and specification of their parameters, if any. Database, table and column drop down boxes can be used to help ensure that valid names are used. If desired, however, database, table and column names can be typed into the parameter fields directly, keeping in mind that only minimal validation is performed on these fields. When a macro is selected and parameters specified, if any, clicking Execute executes the requested macro and display the results in the standard analysis results viewer. For more information, see RESULTS Tab. Note that if new macros have been added in the latest product update, these will be installed if necessary when this button is clicked. • View/Edit — Sub-menu items are displayed for the following options: ∘ View Object Definition — This option performs a SHOW TABLE, SHOW VIEW or SHOW PROCEDURE command as appropriate, to display the SQL definition of the advertised object. ∘ View Data — If the advertised object is of “kind” Table, this option launches the results data viewer used elsewhere in the product for displaying the results stored in a table. ∘ View All Advertised Output — Changes the display to include all advertised objects, regardless of category. ∘ View ADS Output Only — Changes the display to include only objects in the ADS category. ∘ View Score Output Only — Changes the display to include only objects in the Score category. ∘ View Profile Output Only — Changes the display to include only objects in the Profile category. ∘ View Other Output Only — Changes the display to include only objects in the Other category. (For example, an Export Matrix output table, a Decision Tree Profile table, or an Association analysis saved “reduced input table”). ∘ Delete From Metadata… — Brings up a variation of the Metadata Deletion Wizard described in Metadata Maintenance, adapted to delete advertisements rather than projects and analyses. The following features are notable. When the advertisement for a procedure has been selected to delete, any advertisements for objects created by the procedure are automatically also selected to delete, with a count given in the message area at the bottom of the display. This is to avoid leaving partial information in the metadata. An option is provided to “Drop the actual database objects advertised above”. ∘ Synchronize — Clicking this button compares the advertisements of database objects with the Teradata data dictionary entries for these objects in order to set the Status field described above. Note that access to the DBC Tables view is required for this option to function successfully.

Advertised Object Properties The Advertised Object Properties dialog is accessed by selecting the Properties button on the Advertised Objects dialog, displaying the properties of the currently selected object, as outlined below.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 63 Chapter 3: Using Teradata Warehouse Miner Main Menu Advertise Object Properties

Property Description Advertise Object General properties of Advertises Object. Database The database that contains the advertised object (table, view or procedure). Name The name of the advertised object.

Kind • T for Table • V for View • P for Procedure

Category ADS, SCORE, PROFILE or OTHER Status This attribute is used to indicate an object that is either Missing (not in the dictionary) or Outdated (dictionary has a later entry with the same name) with respect to the Teradata data dictionary. Created By The database user that created the object. Created The date and time the object was created. Comment An optional comment describing the object. Advertise Note Optional free-form text that may be used to categorize the object (for example, by internal project or purpose). Row Count Using the Get Count button, the number of rows in a table may be displayed (but only if the advertised object is a table and not a view or procedure).

ADS Properties ADS specific properties of advertised object. AnchorDB The database containing the anchor table, if any. AnchorTable The name of the anchor table, if any. TargetDate The target date if used in the underlying analyses.

ADS Parameters Repeats for each literal parameter, if any. Name The name of the literal parameter. Type String, Numeric, Date, Time, Timestamp, or Text Value A character string representation of the original value of the parameter. Description The optional description of the literal parameter.

Variables Repeats for each variable…provided for both ADS and Score objects. Nbr Sequential integer ordering the variables in the ADS. Variable Name The name of the variable (column). Description The optional description of the variable.

64 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Main Menu

Property Description Variable SQL Pseudo-SQL representing how the variable was derived, concatenating the contributions of referenced analyses, if any, separated by |||. Note that if a column is simply passed along from a referenced analysis, it is not recorded unless the column name changes via aliasing.

Variable Columns Repeats for each source column contributing to each variable… provided for both ADS and Score objects. Nbr Sequential integer ordering the variables, and within each variable, the source columns. Var Name The name of the variable (column). Source DB Database containing the table or view that contains the contributing column. Source Table The name of the table or view that contains the contributing column. Source Column The name of the column that contributed to the creation of the indicated variable.

Score Properties Score specific properties of advertised object. Input Database The database containing the scoring input table. Input Table The name of the scoring input table.

Score Columns Repeats for each column in the score table. Nbr Sequential integer ordering the score output columns. Column Name The name of the score output column. Description If available, a generated description of the use of the score output column (for example, Index Column).

Profile Subject Columns Repeats for each column in the profile table. Subject Database The database that contains the subject table. Subject Table The name of the profiled subject table. Subject Column The profiled column. Description Description of the subject column, if available (for example, Group By).

Procedure Properties Only displayed if advertised object is a procedure. Default Call Procedure call with parameters simulating the results that would occur if a procedure wasn't used, including default literal parameters, if any. Product Version Version of TWM or application that created the procedure.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 65 Chapter 3: Using Teradata Warehouse Miner Main Menu

Property Description Database Version Version of Teradata that was in use when the procedure was created.

Procedure Parameters Only displayed if advertised object is a procedure, repeating for each procedure parameter, if any. Nbr Sequential integer ordering the parameters. Name The name of the parameter. Description The optional description of the parameter. SQLType The SQL type string of the parameter. For example, VARCHAR(30). Original Value A character string representation of the original value of the parameter.

Tables Built Repeats for each table/view built by the procedure, if any. Database The database that contains the table/view created by the procedure. Name The name of the table/view created by the procedure. Procedure Call The procedure call statement in use when the table/view was created by the procedure. Start Time The date and time when procedure execution began. Stop Time The date and time when procedure execution completed.

Created By Procedure Only displayed if the advertised object was created by a procedure. Procedure Database The database that contains the procedure that created the advertised object. Procedure Name The name of the procedure that created the advertised object. Procedure Call The actual procedure call reflecting the parameters used. Start Time Start time of procedure execution. Stop Time Stop time of procedure execution.

Created By Analysis/Chained Describes the analysis that created the advertised object, or the Analysis procedure that created it if applicable. If chained to other analyses, they may be viewed using the selector underneath the display. AnalysisType The type of the analysis that created the object (or is chained to it). Analysis Name The name of the creating or chained analysis. Analysis Created Date and time the creating or chained analysis was created. Analysis Description Optional description of the creating or chained analysis. Metadata Database Metadata database containing the analysis. Project Name Name of the project containing the analysis. Project Created Date and time the project was created.

66 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Main Menu

Property Description Project Description Optional description of the containing project.

Preferences From the Tools menu, select the Preferences menu option to bring up the Preferences dialog. Preferences

Timeouts • General ∘ Database Connection Timeout(s) — Length of time the user interface or analyses waits to connect to Teradata. The default is 30 seconds. ∘ Dictionary Query Timeout(s) — Length of time the user interface waits to interrogate the Teradata dictionary for database, table, view and column information. The default is 30 seconds. ∘ Metadata Query Timeout(s) — Length of time the user interface waits to read and/or update the Teradata Warehouse Miner metadata when projects are opened or saved. The default is 30 seconds. ∘ Data Retrieval Query Timeout(s) — Length of time the data panel or graphics waits to read a result set (table/view) in Teradata. The default is 30 seconds.

Note: The timeout feature is currently only available when connected to a Teradata database.

Controls • Selectors ∘ List columns alphabetically — Column names are sorted alphabetically in column selectors. This option may be overridden when using a particular selector by right-clicking and using the pop-up menu to change the option.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 67 Chapter 3: Using Teradata Warehouse Miner Main Menu ∘ Use bold characters for primary index columns — Primary index column names are displayed in bold font in column selectors. This option may be overridden when using a particular selector by right-clicking and using the pop-up menu to change the option. ∘ Enable multi-column tree view selection — Tree view selectors displaying selected columns will include check boxes to allow selection of multiple columns for use with right-click menu options. These options include options for copying, pasting or selecting column names based on names copied to the Clipboard. This option may not be overridden for an individual selector. Also, if this preference is changed, it does not take effect on currently opened windows until they are closed and reopened.

Limits • Analysis ∘ Maximum Result Rows to Display — Numbers of rows to display in the data panel and graphics for analyses that generate a SELECT, CREATE TABLE or VIEW. The default is 1,000 rows. A maximum amount of 100,000 rows may be specified. A value of -1 displays all rows. A value of 0 is invalid and results in an error. • Wizards ∘ Maximum Distinct Values to Display — Number of distinct values that are fetched by the wizard for analyses that have a wizard to determine distinct values (Recode, Design Code, Denormalization, Logistic Regression, Decision Tree and Variable Creation). The default is 1000 distinct values. ∘ Use sampling to retrieve distinct values data — Number of rows that are initially requested for sampling that can be used for analyses which have a wizard for determining distinct values. The default is 10,000 rows. ∘ Number of Rows to Sample — If Use sampling to retrieve distinct values data is enabled, the number of rows to sample needs to be specified here.

Execution • Projects ∘ Continue project execution upon analysis error — If an analysis fails while executing a project, execution continues with the next analysis.

Archive • Automatic Archive Options ∘ Automatically archive deleted projects, analyses, etc. — Whenever a project, analysis or attachment is deleted, it is automatically archived to the standard location in the application file storage area.

Note: The entire project is automatically saved prior to the deletion of the requested item.

Defaults • Default Output Comment — When a value is entered here, any new analysis of a type that supports the post processing output panel will contain this value as the initial value of the Comment on Output Table field. Note that the text string entered here can contain substitution parameters representing an output category, the project name or the analysis name as follows. The resulting comment must not exceed 255 characters. ∘ - Score / ADS / Stats / Other

68 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Main Menu ∘ - The name of the project that contains the analysis ∘ - The name of the analysis that is creating the output For example, the string TWM : : results in a default output comment of TWM : Score : MyProject for the output of a scoring analysis in the project MyProject. TWM : : is the default value of this option. • Default Procedure Comment — When a value is entered here, any new analysis of a type that supports the option to create a stored procedure will contain this value as the initial value of the Procedure Comment field. Note that the text string entered here can contain the same substitution parameters that are supported in the Default Output Comment field described above. The resulting comment must not exceed 255 characters. TWM : : is the default value of this option. • Default SQL Font — A font type and size may be entered here or selected via a standard dialog by using the Browse button. The specified font will be used as the default font when entering SQL for any new Free Form SQL analyses (although the font in this case may be changed via a right-click option). The specified font will also be used in all new and existing analyses that allow entering free-form SQL text, such as in expert where clause panels and in SQL Text and Formula elements in Variable Creation analyses. The default SQL font is further used in the display of results SQL in all analyses that provide such display and in the display of SQL text for Variable expressions in Variable Creation (through the use of the SQL dialog button). If a font size is entered as text rather than selecting it in the Browse dialog, the value entered must be a decimal value between 6 and 36, entered in the current locale format. Note also that if a font style such as Bold or Italic is selected in the Browse dialog, it is ignored.

Startup • Data Source to connect to — When a data source is entered here, for successive executions of TWM, the user will be automatically connected to this data source. • First project to load — If the user wishes to automatically load a project associated with the data source specified above, the name of the first project to load is entered here. A drop-down list of all projects for the data source is provided for project selection. • Second project to load — The user can load up to three projects upon startup. The optional second project to load is entered here. • Third project to load — The optional third project to load is entered here. • Open new project — Check this option to open a new project automatically upon startup. If checked, the new project is opened after any requested projects are loaded.

Window Menu The following options are available from the Window (Alt-W) menu: • Tile Horizontally — Lays out all open analysis, graphs and results windows horizontally. • Tile Vertically — Lays out all open analysis, graphs and results windows vertically. • Cascade — Lays out all open analysis, graphs and results windows in a cascade. • Arrange Icons — Arranges minimized windows. • Close All Open Windows — Closes all open analysis windows and updates the analyses they correspond to in the project workspace. To update them in metadata, one of the Project > Save option must be used. • Current Windows — Displays a list of available windows to make current.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 69 Chapter 3: Using Teradata Warehouse Miner Toolbar Help Menu The following options are available from the Help (Alt-H) menu: • Contents & Index… (F1) — Displays the Teradata Warehouse Miner Help System. • Import Tutorial Projects… — Imports projects containing tutorial examples, including the tutorial examples in the User Guide and Help System. These projects require that the Teradata Warehouse Miner Demonstration Data be installed in order to execute properly. For more information, see Installing Support Tables and Functions. • View Analysis Template Instructions… — Displays the instructions that accompany the selection of an analysis template when adding a new analysis to a project. The following template types are available. ∘ Derived Table ∘ Subquery ∘ With Query ∘ Recursive Query ∘ Recursive View ∘ Union • View Teradata Documentation… — A menu of links is provided to Teradata and TWM reference manuals on http://www.info.teradata.com, including manuals from the current and several previous releases. • About Teradata Warehouse Miner… — Displays the Teradata Warehouse Miner About screen showing version number.

Toolbar The Teradata Warehouse Miner toolbar consists of several icons that provide a single-click interface to many important Teradata Warehouse Miner features.

Toolbar Icons

Icon Tooltip Description Add New Project Create a new Teradata Warehouse Miner data mining project.

Add New Analysis Add a Teradata Warehouse Miner analysis to the currently open data mining project. Brings up a dialog to select any of the available analyses. Add Existing Analysis Add copies of one or more existing analyses to the currently selected project. Open Project Add one or more existing projects to the project workspace.

Save Project Save the specified data mining project, and all the analyses within the project. Save All Save all the currently open data mining projects, and all the analyses within those projects.

70 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Project Area

Icon Tooltip Description Run Execute the currently selected analysis.

Stop Stop the currently executing and selected analysis

Open Connection Display the Open Connection dialog. This is much like the standard ODBC connection dialog except that it supports Simple (IP Address/ Host Name) connections in addition to System, User and File DSNs. Connection Properties Display the Connection Properties dialog as illustrated above.

The last item on the toolbar is the name of the data source that will be used to connect, or blank if no connection has yet been established. Note that this is not a persistent connection; connection pooling is used to obtain a connection only when needed. Other than the New Project, Open Project, Open Connection, and Connection Properties icons, any toolbar icon can be grayed out, depending upon the connection state, and whether or not a project has been opened and/or created by the GUI.

Project Area The Project Area displays all currently loaded projects and their analyses and attachments, if any. To hide the Project Area, click on the X in the upper right hand portion of the window. To show the Project Area, select the left-arrow button that is displayed just above the Project Area when closed or select the View > Project Window option. If the Project Area is currently hidden, this will make it visible. Selecting an analysis in the Project Area will not only highlight the selected analysis, but each analysis that the selected analysis references for input, directly or indirectly, temporarily italicizing the name of each referenced analysis. Similarly, each analysis that the selected analysis depends on (such as a scoring analysis depending on an algorithm) is also highlighted with the analysis name temporarily underlined. Note that when an individual selected analysis is executed, referenced analyses with italicized names will be executed first, whereas dependent analyses with underlined names will not be automatically executed. When a project is selected, the following right-click menu options are available. • View — View the currently selected project. • Run — Run the currently selected project within the project window. Alternately, using the function F5 key can execute a selected project as well. To execute a project, each analysis in the project is executed in order with the exception that before each analysis is executed, each analysis that it may refer to as the source of its input is automatically executed first. This is done in such a way however that each analysis in a project is only executed once. • Stop — Stop the currently running and selected project. • Add New Analysis… — Add a new analysis to the project. • Add Existing Analysis … — Add copies of one or more existing analyses to the currently selected project. For more information, see Adding Analyses to a Project. • Add Attachment … — Create an attachment folder within the specified project and copy any valid windows object into it, such as a document file.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 71 Chapter 3: Using Teradata Warehouse Miner Project Area • Save — Save the specified project and its contents. If another user has saved the same project in another session after the project was last saved or opened in this session, a message is given, along with an option to save a copy of the project, but it is not allowed to overwrite the original project in this case. • Save As — Save the specified project and its contents under a new name. Note that this option creates a new project containing copies of all analyses within the current project. • Save and Archive — Save the specified project and its contents along with archiving the project to an export file in the standard automatic archive. (The automatic archive facility is described in Metadata Maintenance, specifically in the portion describing the Archive button.) • Save All Projects — Save all of the currently open projects and their contents. • Close — Close the specified project. • Close All Projects — Close all projects that are open in the workspace. • Delete — Delete the currently selected project within the project window. Before the project is deleted, a verification message is given that provides an opportunity to cancel the deletion. • Rename — Rename the specified project. • Export… — Display the Export Wizard in order to save the currently selected project to a binary file for possible retrieval on the same or different system using the Import Wizard. • Extract SQL — This option extracts the generated SQL from the results of all the analyses contained in the project, placing it in a display so that it can be reviewed, edited, copied, written to a file or placed in a Free Form SQL test node. For more information, see the description of the Extract SQL option in Project Menu. • Create Results Files — This option enables the creation of tab-delimited files from the result sets of all of the analyses in a project. A dialog is displayed to enable the selection of a destination folder for the files and to watch the progress of the building of the result files, along with an option to export selected result sets to Excel. For more information, see the description of the Create Results Files option in Project Menu. • Create SQL Node to Drop Tables — This option creates a Free Form SQL analysis in the current project that contains statements to drop every table and view created in the currently selected project or analysis. • Create SQL Node to Collect Statistics — This option creates a Free Form SQL analysis in the current project that contains statements to collect statistics on the primary index of every table created in the currently selected project or analysis. Note that if an output table does not exist, the primary index columns will not be known and the SQL to collect statistics on that table will be incomplete, which will be noted in a comment after the command. • Properties — The Properties dialog may be displayed for the currently selected project or analysis by selecting this option, as described below. When an Analysis is selected, the following right-click menu options are available. • View — View the currently selected analysis. • Run — Run the currently selected analysis within the project window. Alternately, using the function F5 key can execute a selected analysis as well. Before an analysis is executed, each analysis that it may refer to as the source of its input is automatically executed first. Since a referenced analysis may refer to other analyses as well, executing an individual analysis may result in the execution of a series of analyses. • Run To End — Run the currently selected analysis within the project window followed by each of the subsequent analyses until the end of the project. Alternately, using the function F6 key can execute a selected analysis and subsequent analyses as well.

72 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Project Area Note that execution order may be affected when an analysis refers to another analysis in the project for its input, in which case the referenced analysis is always executed before the analysis that refers to it. No analysis is executed more than once. • Run Stand-Alone — Run the currently selected analysis within the project window without executing any analyses that it may reference for input. Alternately, using the function F7 key can execute a selected analysis as well. Note that this option may result in an error if an analysis referenced for input has not created the required input table or if a created volatile table is no longer available. This option may be ignored in the Teradata Profiler product. • Skip during Project Execution — Use this option to skip the execution of an individual analysis when the project that contains it is executed, whether in whole or in part when the Run to End option is used. It does not, however, skip the execution of an analysis that is executed by itself or as the result of being referenced by another analysis that is being executed by itself. The option works by toggling a flag that is saved with the analysis and is honored both in subsequent interactive and batch executions. Skipping the execution of an analysis during project execution may be useful in many instances. For example, the option can be used to avoid rebuilding a matrix every time an algorithm that uses it is executed to build a model, or similarly to avoid rebuilding a model every time an analysis that scores it is executed. Similarly, it can be used to avoid executing a chain of analyses an extra time when they are being refreshed or published with the Refresh or Publish analysis, respectively. • Stop — Stop the currently running and selected analysis. • Delete — Delete the currently selected analysis within the project window. Before the analysis is deleted, a verification message is given that provides an opportunity to cancel the deletion. If the analysis to be deleted is referred to by another analysis for its input, the verification message indicates this fact. • Delete without archiving/Delete after archiving — This option allows the user to override the current setting of the preference option to Automatically archive deleted projects, analyses etc. when deleting the currently selected analysis. If the automatic archive preference option is checked, this option reads Delete without archiving, and if not checked this option reads Delete after archiving. Otherwise, the Delete option is performed as described above. • Rename — Rename the specified analysis. • Export… — Display the Export Wizard in order to save the currently selected analysis to a binary file for possible retrieval on the same or different system using the Import Wizard. • Clone — Make a copy of the currently selected analysis and insert it into the project just after the analysis being copied. • Move to Top — Move the selected analysis to the top of the project's list of analyses. • Move to Bottom — Move the selected analysis to the bottom of the project's list of analyses. • Log Analysis Info — Information about the currently selected analysis that might be useful for support purposes is written to the information log. The log can be viewed by selecting the Tools > View Logs - View Info Log… option. • Extract SQL — Extract the generated SQL from the results of the selected analysis, placing it in a display so that it can be reviewed, edited, copied, written to a file or placed in a Free Form SQL test node. For more information, see the description of the Extract SQL option in Project Menu. • Create Results Files — Enable the creation of tab-delimited files from the result sets of a selected analysis. A dialog appears that lets you enable the selection of a destination folder for the files along with an option to export the result set(s) of the selected analysis to Excel. For more information, see the description of the Create Results Files option in Project Menu. • Create SQL Node to Drop Tables — Create a Free Form SQL analysis in the current project that contains statements to drop every table and view created in the currently selected project or analysis.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 73 Chapter 3: Using Teradata Warehouse Miner Execution Status Area • Create SQL Node to Collect Statistics — Create a Free Form SQL analysis in the current project that contains statements to collect statistics on the primary index of every table created in the currently selected project or analysis. If an output table does not exist, the primary index columns will not be known and the SQL to collect statistics on that table will be incomplete, which will be noted in a comment after the command. • Properties — The Properties dialog may be displayed for the currently selected project or analysis by selecting this option, as described below. When an Attachment is selected, the following right-click menu options are available. • View — View the currently selected attachment (i.e., the attachment is opened with the appropriate program based on association with the files extension, such as .txt). • Delete — Delete the currently selected attachment. • Rename — Rename the currently selected attachment. • Export… — Display the Export Wizard in order to save the currently selected attachment to a binary file for possible retrieval on the same or different system using the Import Wizard. When an Attachment folder is selected, the following right-click menu options are available. • Add Attachment … — A file may be selected and added to the project's Attachment Folder. • Export… — Display the Export Wizard in order to save all of the attachments in the selected Attachment folder to a binary file for possible retrieval on the same or different system using the Import Wizard.

Execution Status Area The Execution Status Area displays all messages generated by the Teradata Warehouse Miner analyses. Each entry includes the name of the Analysis, the Status and the Message. To hide the Execution Status Area, click on the X in the upper right hand portion of the window. To show the Execution Status Area, select the View > View Execution Window option. If the window is currently hidden, this will make it visible. As an alternative, click the upward-pointing directional button in the upper right portion of the main window to reopen the Execution Window. The following right-click menu options are provided in the Execution Status Area: • Clear All Messages • Clear Selected Messages • View Error Log • View Execution Log

TWM Data Mining Projects

Creating or Opening a Project 1. To create a new Teradata Warehouse Miner Data Mining project, do one of the following: • Click the Add New Project icon on the toolbar. • Select File > Add New Project. This creates a new Data Mining Project that contains any created Analyses available within the product.

74 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner TWM Data Mining Projects You can rename the project using one of the following options: • Highlight and the project name and single-click. • Right-click the project name and selecting the Rename option. A text box appears with the project name highlighted much like a standard rename through Windows Explorer. 2. To open a previously saved Teradata Warehouse Miner Data Mining project, do one of the following: • Click on the Add Existing Project icon on the toolbar. • Select File > Add Existing Project . This brings up a dialog similar to the Metadata Maintenance dialog with a simplified set of option buttons. Selecting the desired projects to open and clicking OK will add them to the Project Explorer window. However, if the Save to Local Files option is checked on the Connection Properties dialog, a file dialog is displayed instead of the Metadata Maintenance dialog, from which the desired project or projects can be selected and displayed in the Project Explorer window.

Saving a Project As previously mentioned, a Teradata Warehouse Miner Data Mining project is a collection of available analyses. These analyses must exist in one and only one project. At any time, an entire project and all associated analyses can be saved via any of the following mechanisms: • File Menu ∘ Save — Save the specified project and its contents, either to metadata tables or to export files in a special folder in the application work space. If another user has saved the same project to metadata in another session after the project was last saved or opened in this session, a message appears, along with an option to save a copy of the project, but the original project cannot be overwritten. ∘ Save As — Save the specified project and its contents under a new name. Note that this option creates a new project containing copies of all analyses within the current project. ∘ Save and Archive — Save the specified project and its contents along with archiving the project to an export file in the standard automatic archive. The automatic archive facility is described in Metadata Maintenance, specifically in the portion describing the Archive button. ∘ Save All Projects — Save all of the currently open projects and their contents.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 75 Chapter 3: Using Teradata Warehouse Miner TWM Data Mining Projects Note: Any of these options can be invoked either from the File menu on the Toolbar or from the right- click menu in the Project window. When a project is closed, when the connection to the Teradata database is changed, or when the application is exited, the user is prompted to save anything not currently saved using the following dialog. Save Changes to Project

If the box next to the project is checked, it is saved along with all analyses. Alternately, click to uncheck the box to disable the save for any particular project. Then click on one of the following buttons: ▪ Yes — Save all indicated projects ▪ No — Do not save anything and continue with the operation that caused the save. ▪ Cancel — Cancels the command without saving and returns to the application.

Removing a Project A project in the Project Explorer window can be permanently removed. 1. Do one of the following: • Right-click on the project name within the Project Explorer window and select the Delete option. • Select Project > Delete when the project is highlighted. This permanently removes the project either from the metadata tables or from the special folder of projects saved with the Save to Local Files option. For this reason, a confirmation dialog appears. 2. Click OK to permanently delete the project or Cancel to abort the removal. If the Automatic Archive feature is selected on the Preferences dialog, a copy of the deleted project is saved in the standard automatic archive before it is deleted. The automatic archive facility is described in Metadata Maintenance, specifically in the portion describing the Archive button.

76 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner TWM Data Mining Projects Adding Analyses to a Project A new or existing analysis can be added to the specified Teradata Warehouse Miner Project via the Add New Analysis… or Add Existing Analysis… icons on the Toolbar, or the File or Project menu options with the same names. Additionally, right-clicking within the Project Window will bring up a menu with both the Add New Analysis… and Add Existing Analysis… options on it.

Adding a New Analysis 1. From the File menu, select Add New Analysis. Add New Analysis

2. Double-click or highlight the desired analysis and click OK. The window for the selected analysis appears. 3. Click the Variable Creation analysis icon. One or more additional fields are displayed, as shown in the following example.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 77 Chapter 3: Using Teradata Warehouse Miner TWM Data Mining Projects Add New Analysis: Variable Creation

• Analysis name — This is the name to assign to the new analysis. The default shown is based on the type of analysis and a counter to make it unique within the project. • Analysis template — This is an optional template indicator (default is none). When a template type is selected here and OK is selected, two or more analyses are created as called for by the template, with connecting fields and required options pre-set. Immediately after the analyses are created, a pop-up viewer is displayed with further instructions to be followed by the user to complete the template. The templates provided include: ∘ Query with Derived Table — A derived query analysis and base query analysis are created. ∘ Query with Subquery — A subquery analysis and base query analysis are created. ∘ With Query — A With query analysis and base analysis are created. ∘ With Recursive Query — A With Seed query, With Recursive query and base query analysis are created. ∘ With Recursive View — A With Seed query, With Recursive query and base view analysis are created. ∘ Union of Queries — Two analyses to combine with a set operator are created. • Operator — This drop-down selector is displayed only when the requested template type requires it. The possible values are outlined below. Query with Subquery ∘ Between ∘ Exists ∘ In ∘ Is Null ∘ Is Not Null ∘ Like ∘ Not Between

78 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner TWM Data Mining Projects ∘ Not Exists ∘ Not In ∘ Not Like The SQL predicate operators above are the operators most frequently used with subqueries. See the Variable Creation – INPUT – Variables – SQL Elements section in the Teradata Warehouse Miner User Guide (Volume 2), B035-2301, for more information about these SQL operators. Union of Queries ∘ UNION ∘ UNION ALL ∘ INTERSECT ∘ INTERSECT ALL ∘ EXCEPT ∘ EXCEPT ALL ∘ The keyword ALL in the list above means to retain duplicate rows in the result set. The UNION keyword indicates that rows output from both the first and second analysis should be included in the answer set. INTERSECT is used to indicate that only rows included in both input data sets should be included in the output data set. And finally, EXCEPT is used to indicate that all rows from the first data set should be included in the output data set except those occurring in the second data set.

Adding an Existing Analysis

Note: The Add Existing Analysis option is not available when the Save to Local Files option is selected on the Connection Properites dialog.

When adding an existing analysis, a dialog similar to the Metadata Maintenance dialog with a simplified set of option buttons is displayed. By selecting the top-most Analyses folder on the left side display or one of the category folders underneath it, the desired analysis or analyses to add can be selected from a list of all analyses in the repository or all the analyses of a given type category displayed on the right-side view. Similarly, by selecting a project on the left-side view, one or more analyses contained in the selected project may be selected from the right-side view. Adding an analysis creates a copy of the analysis within the current project or a new project and opens the form for the analysis, displaying the parameters and any results that were saved in the analysis that was copied. It will also create a copy of any analysis referred to by the specified analysis and any analyses that they in turn refer to. If, due to analysis references being present, any additional analyses will be added to the current project, a message appears with the option to continue or cancel the operation.

Note: To see ahead of time what analyses, if any, are referred to by a singly selected analysis, analysis reference shading may be activated by selecting the Properties display and by further selecting an option that requires the analysis to be memory resident (i.e., the Parameters tab, the References tab or the Display names of output tables… option).

Clicking on a display column header sorts the entries by the contents of that column, first in ascending order, then in descending order on the next click. This can be useful for such tasks as finding a particular analysis in a particular project, or in finding the last analysis modified.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 79 Chapter 3: Using Teradata Warehouse Miner Analysis Input Screen Checking the Add to new project option causes the selected analyses to be added to a newly created project. This means that an analysis can be added to the project workspace even when there are currently no projects in the workspace. Checking the Map database objects option and clicking OK after selecting an analysis brings up a dialog similar to that seen when importing a project (the labels change to reflect that an analysis is being loaded). This makes it possible to map to other values the databases, tables or columns in the analysis copy and the analyses it refers to, if any, which are also being loaded. For more information on mapping database objects, see Import Wizard.

Removing an Analysis Although individual analyses cannot be saved as they must exist in one and only one Teradata Warehouse Miner Project, they can be permanently removed by right-clicking on the analysis name within the Project Window and selecting the Delete option, or by selecting the Delete option from the Project menu when the analysis is highlighted. As this will permanently remove the analysis from the project, a confirmation dialog appears. If the analysis to be deleted is referred to either directly or indirectly by another analysis via an input, matrix or model reference, the verification message both counts and lists the names of the referencing analyses. Click OK to permanently delete the analysis or Cancel to abort the removal. The referencing analyses, if any, are not deleted. If the Automatic Archive feature is selected on the Preferences dialog, a copy of the deleted analysis is saved in the standard automatic archive before it is deleted. The automatic archive facility is described in Metadata Maintenance, specifically in the portion describing the Archive button. When the Save to Local Files option on the Connection Properties dialog is checked, removing an analysis means removing it from the Project Explorer work are, but not the export file with matching project name in its special folder.

Properties When a project is selected in the Project window, the Properties option may be selected to display the Project Properties dialog, as described in Project Menu. When an analysis is selected, the Analysis Properties dialog may be similarly displayed. The Properties display may be requested from the Project Menu, from the right-click menu in the Project window, or from the Metadata Maintenance dialog available from the Tools Menu.

Analysis Input Screen Though there are variations for some types of analysis, the typical analysis input screen looks like the following:

80 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Analysis Input Screen Adaptive Histogram Input Screen

Each analysis form has three tabs - INPUT, OUTPUT, and RESULTS.

INPUT Tab The typical INPUT tab contains an initial data selection tab with the fields described in the following sections: • Select Input Source • Select Columns From...

Select Input Source

Note: Teradata Profile users may skip this section.

• Input Source — Selecting the Table option lets the user select from available databases, tables (or views) and columns in the usual manner. Selecting the Analysis option lets the user select directly from the output of another analysis of qualifying type in the current project. Selectable analyses include all of the Analytic Data Set and Reorganization analyses except Refresh, namely Build ADS, Variable Creation, Variable Transformation, Denorm, Join, Merge, Partition and Sample. In addition, the Free Form SQL analysis may be selected from directly when certain information has been provided in the analysis. See Free Form SQL for more information. •

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 81 Chapter 3: Using Teradata Warehouse Miner Analysis Input Screen In place of Available Databases, the user may select from Available Analyses, including all “qualifying” analyses of the above types in the current project. An analysis does not qualify for inclusion in the pull-down list if it does not create a table or view and if the referencing analysis is an algorithm, matrix, scoring or Data Explorer analysis. This is because such analyses cannot access a volatile table for input. Available Tables offers a list of all the output tables that will eventually be produced by the selected analysis, or a single entry with the name of the analysis under the label Volatile Table if the analysis does not create an output table or view. Selecting an analysis as input rather than a table or view has several advantages. 1. It ties together the analyses needed to create an analytic data set. This is a prerequisite for refreshing a data set with a possibly different target date, anchor table or output table, and for publishing the SQL to create an analytic data set in the Model Manager application. 2. It can be used to automatically substitute a volatile table for the output of an analysis when an output table or view is not created, thus saving time and permanent space in creating data sets, as well as removing the need to name intermediate tables or views. 3. When an analysis produces an output table or view, the analysis that reads it does not have to know the name of the table or view it is using as input. When an analysis is executed, either individually or within the context of an entire project, any referenced analyses are automatically executed first by the application. Other considerations are given when adding or deleting an existing analysis that refers to other analyses. One disadvantage with selecting for input an analysis that does not create a table or view (and therefore creates a volatile table) is that any values wizard designed to select column values out of the input table will typically not work. If values retrieval fails in this case, a message appears to explain why and provides suggestions. Another disadvantage is that use of the output option to create a view (in the referencing analysis) is inappropriate, since the referenced volatile table will not be available if the view is accessed later in another context. Finally, such analyses that do not create a table or view cannot be selected for input by algorithm, matrix, scoring or Data Explorer analyses. Finally, it may be useful to note that referencing an analysis for input can also be used to embed the SQL generated by another analysis as a derived table in the referencing analysis. To achieve this result, both the referencing and the referenced analysis must first be a Denorm, Join, Merge, Partition, Sample, Variable Creation or Build ADS analysis (i.e., one of the “selectable” analyses other than Variable Transformation). A Free Form SQL analysis may also serve as the referenced analysis, as described in more detail in Free Form SQL. Further, the Output Storage option to Generate the SQL for this analysis, but do not execute it of the referenced analysis must be set to true while its option to Store the tabular output of this analysis in the database must not be set to true. Although the net effect of executing the resulting derived table should be the same as having the referenced analysis create a volatile table, it may be preferable to generate the SQL with a derived table instead (possibly to use the SQL in another context). Note that using this technique it is even possible to create nested derived tables.

Select Columns From...

Column Selection Methods • The arrow buttons on the Analysis Input screen can be used to select or de-select columns displayed on the left into or out of one or more tree-structured views on the right. • Columns may be selected or de-selected by dragging them from the available columns view to a selected columns view, or back again.

82 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Analysis Input Screen This is done by first highlighting one or more columns with left-mouse clicks, holding down the control or shift key to highlight multiple columns, and then holding down the left-mouse button while the mouse is dragged to the desired window and dropped by releasing the left-mouse button. • Columns can be selected individually (but not de-selected) by double-clicking on them to select them into the currently highlighted tree-structure view. Note that most data selectors have a specialized selection mechanism specified by the “…” icon. This is analysis dependent and is a shortcut for data selection. When the icon is clicked, one or more tooltips are displayed, one of which can be chosen for data selection. For example, the specialized selection mechanism for the Data Explorer is “Select All Columns.” In this case, all columns are automatically moved from the Available to Selected lists. For Statistics, this mechanism is “Select all Numeric Columns.”

Tool-Tip Information Tool-tip information is available about columns in the Available Columns and Selected Columns panels by holding the mouse pointer over a particular column name. The displayed fields include the following. Data Type, Character Set and Comparison are not displayed for columns from a referenced analysis. • Name - the name of the column • Type - a category (Numeric, Character, Date, Time, Timestamp, BigInteger, BigDecimal, Unknown and NotSupported) based on the actual data type and determining the color of the icon used for the column in list and tree views • Analytic Type - continuous or categorical (based on data type only, not use) • Data Type - the actual Teradata data type • Character Set - Latin, Unicode etc. (only if Character type) • Comparison - Case Specific or Not Case Specific (only if Character type)

Right-Click Menu Options - Available Columns In the Available Columns panel on the left side of the input screen, the user can right-click and receive menu options as described below. These options work together with the Preferences set from the Tools menu. The Primary Index and Sort Order options may be used to temporarily override the default settings in Preferences. • Clipboard ∘ Copy — Copies names of selected columns to the Clipboard. ∘ Paste — (Not available for Available Columns) ∘ Select — Selects (highlights) columns with matching names on the Clipboard. ∘ Primary Index ▪ In Bold — Displays primary index columns in bold font. ▪ Normal — Displays primary index columns in normal font. ∘ Sort Order ▪ Alphabetical — Displays column names in alphabetical order. ▪ Table Order — Displays column names in table order. ∘ Show Table/View — This option displays the results of a SHOW TABLE or SHOW VIEW SQL command for the currently selected table or view. For a table, various table options such as FALLBACK are displayed, along with column names and types and a primary index clause, if any. For a view, the statement that was used to create the view is displayed. The dialog used to display the results is resizable and contains a Word Wrap option.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 83 Chapter 3: Using Teradata Warehouse Miner Analysis Input Screen Right-Click Menu Options - Selected Columns In the Selected Columns panel on the right side of the input screen, the user can right-click and receive menu options as described below. These options work together with the Preferences set from the Tools menu. In particular, the option to Enable multi-column tree view selection must be set in order for Clipboard Copy and Select options to work with Selected Columns. • Clipboard ∘ Copy — Copies names of selected columns to the Clipboard. ∘ Paste — Takes the list of columns on the Clipboard, if any, and if they occur in the Available Columns list, copies them into the Selected Columns list. ∘ Select — Selects (highlights) columns with matching names on the Clipboard. • Switch Input To — This option applies when a database, table or column is highlighted on the right side of the input screen. When this option is selected, the database, table and/or column selectors on the left side of the input screen are adjusted to match the highlighted database, table, analysis or column on the right side of the screen.

OUTPUT Tab The OUTPUT tab contains one or more of the following tabs depending on the particular analysis and whether the chosen output style is to write output to a table. • Storage • Primary Index (Teradata Database) • Post Processing

Storage The specific options are explained in the sections defining the output for specific analyses. For more information regarding the Stored Procedure and Advertise Output options, see Stored Procedure Support (Teradata Database) and Advertise Output.

Teradata Database When connected to a Teradata database, the storage tab may vary depending on the analysis, but typically looks like the panel shown below.

84 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Analysis Input Screen Output Storage on Teradata Database

Aster Database When connected to a Aster database, the storage tab may vary depending on the analysis, but typically looks like the panel shown below (which applies to Descriptive Statistics functions).

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 85 Chapter 3: Using Teradata Warehouse Miner Analysis Input Screen Output Storage on Aster Database

Note that the Stored Procedure, Procedure Comment, Create output table using the FALLBACK keyword and Create output table using the MULTISET keyword fields are not available in Aster. The following storage tab is a variation that applies to the ADS and Reorganization analyses.

86 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Analysis Input Screen Output Storage on Aster Database (ADS and Reorganization Analyses)

The following output table options are specific to Aster database connections: • Table Type ∘ Fact Table ∘ Dimension Table • Distribution Type ∘ Distribute by Hash ∘ Distribute by Replication • Persistence Type ∘ Permanent Table ∘ Analytic Table • Storage Type ∘ Row Storage ∘ Column Storage • Compression Type ∘ None ∘ Low ∘ Medium ∘ High • Hash Column (only if Distribute by Hash)

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 87 Chapter 3: Using Teradata Warehouse Miner Analysis Input Screen Primary Index (Teradata Database)

Note: None of the fields on this screen is enabled when connected to an Aster database.

The primary index tab is only avaliable when connected to a Teradata database. The primary index tab below is available for most analyses that produce an output table containing user data (or more specifically, for analyses in the Reorganization and ADS categories other than Denorm and Refresh). The specific options are explained in the sections defining the output for specific analyses. The default selection is most typically determined by the primary index columns of the input table. Output - Primary Index

The options displayed on the primary index tab are dependent on the options selected on the storage tab and the release of Teradata in use. In particular, no options are displayed on the primary index tab when creating a database view or explaining the query plan. Note, however, that options are displayed even when the output style is to select data and not create an output table. In this case, the primary index information is used if the analysis is referenced and changed to produce a volatile output table. The Additional Information text box can be used to specify a request for index partitioning, or in some cases, column partitioning. For more information about column partitioning, see Teradata Columnar Support (Teradata Database). The option to Create Table with NO PRIMARY INDEX is displayed only when the release of Teradata in use is 13.00 or greater, and a table is being created or results are being selected. If this option is selected, the options relating to collecting statistics on the post processing tab described in Post Processing are not given.

88 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Analysis Input Screen Post Processing

Note: None of the fields on this screen except Further SQL Commands to Execute is enabled when connected to an Aster database.

The post processing tab is available for most analyses that produce an output table or view. The panel shown below is given if output is directed to a database table. If it is directed to a database view, then only the first option is provided, with the label changed to Comment on Output View. In general, Collect Statistics and Further SQL commands are applied to both permanent and volatile tables with exceptions noted below, while Comments are ignored for volatile tables. • The Variable Transformation analysis does not collect statistics for the volatile table created when null value replacement is performed in combination with a specific transformation. • The Data Explorer analysis can produce multiple output tables, and so applies the requested post processing equally to each of the output tables, but notably does not offer the Further SQL commands to execute option. • The Sample analysis can produce a table, multiple tables or a table plus multiple views, and so applies the requested post processing to each output table or view as appropriate. In particular, the Comment on Output Table option is applied to each permanent table and view, while the Collect Statistics command is applied to both volatile and permanent tables and Further SQL commands options are applied only to permanent output tables.

Note: The Sample analysis is only available when connected to a Teradata database.

Output Post Processing on Teradata Database

The following options are available only when connected to a Teradata database:

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 89 Chapter 3: Using Teradata Warehouse Miner Analysis Input Screen • Comment on Output Table — Enter a comment without quotes (up to 255 characters) to be applied to the output table or view via an SQL COMMENT statement. It may contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags , and , respectively). Note that the default value of this field may be set on the Defaults tab of the Preferences dialog box that is available from the Tools > Preferences menu option. This field is ignored for volatile tables. • Collect Statistics on Primary Index of Output Table — Option to collect statistics on the primary index of the output table via an SQL COLLECT STATISTICS statement. This request is performed for both permanent and volatile tables except as already noted. ∘ Use a Sample in collecting the statistics — Option to specify that a system-selected percentage of rows be used in collecting the statistics rather than performing a full-table scan. • Further SQL commands to execute — Free-form SQL commands can be entered here to be executed against the output table during post-processing. It is important to substitute the symbols in place of the name of the output table in these commands so that if the output table or database changes in this analysis, in a Refresh analysis or in the Model Manager web-based application, the correct output table name will be in force. This request is performed for both permanent and volatile tables except as already noted.

Note: If stored procedure output is requested for the analysis or for a Refresh chained to the analysis, the last SQL command must be terminated by a semi-colon.

RESULTS Tab The RESULTS tab is empty until the analysis is executed. Each RESULTS tab contains one or more of the following tabs under the RESULTS tab which changes the results to browse. Subsequent sections describe the results pertinent to each analysis. • Reports — The Reports tab includes all the messages generated by the analysis as well as report tables generated by the analysis. Typically, these are statistical reports generated by the analytical algorithms or scoring functions. • SQL — The SQL tab changes the form to show any SQL generated by the analysis. By default, all SQL statements are shown; however, statements can be shown individually through the Show Statement pull- down list. The following options are provided, both via buttons and right-click menu options. ∘ The Copy function copies all the highlighted SQL text to the Clipboard. ∘ The Find function provides an option to search for whole words and an option to search for a case specific match, along with an option to start the search from the beginning or from the current cursor position. ∘ The Select All function selects (highlights) the entire SQL text. ∘ The Create SQL Node function places the entire SQL text into a new Free Form SQL analysis in the same project. • Data — The Data tab changes the form to show any data generated by the analysis. If tables are created, a Load button must be clicked to show the data generated within a data grid. Once this is done, or if a SELECT statement was requested, the data grid is displayed. Note that if the Maximum result rows to display field in Tools > Preferences > Limits is set to a value less than the rows in the result table, only

90 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Analysis Input Screen the rows specified by the limit are shown. Scoring is a special case in which a sample of the number of rows specified by the limit is taken. If the analysis generated multiple tables, each can be viewed via the result table pull-down list. The following buttons provide options when viewing results data. These same options are also available as right-click menu options. ∘ Edit — Clicking Edit provides a pop-up menu with the following options: ▪ The Copy option copies all the highlighted SQL text to the Clipboard. ▪ The Find option provides an option to search the current column only (i.e., the column that contains the currently selected cell), and an option to search for a case specific match, along with an option to start the search from the beginning or the current cursor position. If the specified text is found in a cell, that cell is selected. In order to further find the desired text within the selected cell, the View Cell function can be used. ▪ The Select All option selects (highlights) the entire SQL text. ▪ The View Cell option copies the text within a cell in the results data grid into a text viewer for easier viewing. The text viewer is resizable and provides a “word wrap” option, as well as the right- click menu options Copy, Find and Select All. The View Cell option can also be invoked by double-clicking on a cell. (The View Cell option is useful when the cell contains more data than can be easily viewed or when it contains embedded line feed characters, such as with the results of a SHOW TABLE command). Note that the View Cell window can be kept open while different cells are selected in succession and their contents viewed in the open window. ▪ The Show Table/View option is available only when the analysis has created an output table or view. It executes an SQL SHOW TABLE or SHOW VIEW command against the output table or view and displays it in a text viewer. As with the View Cell window, the text viewer is resizable and provides a “word wrap” option, as well as the right-click menu options Copy, Find and Select All. ∘ Format — The data grid supports both numeric and date formatting options for each column in the data set. Two values are used to format numeric data—the ‘0’ and the ‘#’ characters. ▪ The ‘0’ (zero) character forces the formatting for a number regardless of the precision of the number. For instance, the number ‘1’ with the format ‘0.00’ would become ‘1.00’ and the number ‘1.2345’ would become ‘1.23’. ▪ The ‘#’ (pound) character will display a digit only if it exists. For example, the number ‘1’ with the format ‘#.##’ would become ‘1.’ and the number ‘1.2345’ would become ‘1.23’. An exception to this is that with scientific notation, leading '#' characters behave like '0' characters and don't work as expected for decimal or exponent places. Date and timestamp formatting use the characters ‘y, M, d’ for date formats, and ‘H, h, m, s’ for time formats. ▪ The year format character ‘y’ may appear as two or four characters. ▪ The ‘M’ character for month may appear as one, two, three or four characters, giving a one/two digit, two digit, three character abbreviated or full name version of the month, respectively. ▪ The ‘d’ character for day may appear as one, two, three or four characters, giving a one/two digit, two digit, three character abbreviated or full name version of the day of the week, respectively. 'H', 'h', 'm' and 's' characters may appear as one or two characters. ▪ The 'H' character gives an hour value in a 24-hour format. ▪ The 'h' provides an hour value in a 12-hour format. ▪ The 'm' character represents minutes and the 's' character seconds. ▪ Decimal seconds may be shown by following 's' or 'ss' with decimal places, as in 'ss.00'.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 91 Chapter 3: Using Teradata Warehouse Miner Analysis Input Screen The letters 'tt' after a timestamp format adds AM/PM as appropriate. Any separator may be used to separate the formatting, although the most popular for dates are the ‘/’ and ‘-‘ characters. The formatting of result columns of type Time is not provided. Several built-in formats are automatically included as Available Format Strings. These may vary depending on locale and the standard numeric and date formats for that locale. Custom formats may also be entered, which are automatically added to the Available Format Strings and retained as long as the Results window remains open. The following table lists the types of built-in formatting based on a U.S. English locale.

Built-In Formatting

Format String Column Type Description (None) All No formatting 0 Numeric Integer format 0.00 Numeric Decimal format with standard decimal places #,##0 Numeric Integer format with standard grouped digits #,##0.00 Numeric Decimal format with standard grouped digits and decimal places 0% Numeric Integer percentage 0.00% Numeric Decimal percentage with standard decimal places 0.00E+00 Numeric Scientific notation with standard decimal places and one higher order digit ##0.00E+00 Numeric Scientific notation with standard decimal places and 3 higher order digits M/d/yyyy Date/Timestamp Brief date format M/d/yyyy h:mm tt Date/Timestamp Brief date and time format dddd, MMMM dd, yyyy Date/Timestamp Complete date format dddd, MMMM dd, yyyy Date/Timestamp Complete date and time format h:mm:ss tt yyyy'-'MM'-'dd'T'HH':'m Date/Timestamp Date and time in a sortable format m':'ss MMMM dd Date/Timestamp Month name and day MMMM, yyyy Date/Timestamp Month name and year

Clicking Format displays the following dialog.

92 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Analysis Input Screen Format Columns

Highlight one or more columns to format, and either type in the desired format within the Selected Format String text box or select it from the Available Format Strings list to apply it to the selected columns. Click OK to accept all changes. Note that the first highlighted column in the list of columns determines the initial entry in the Selected Format String text box. ∘ Sort — The data grid supports an N-Column sort. Clicking Sort displays the following dialog. Sort Columns

Select the column(s) you wish to sort the data set by, either by dragging and dropping from Available Columns to Selected Columns, or highlighting the desired column(s) and clicking on the right arrow. Once the appropriate columns have been selected, use the up/down arrows to specify the order

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 93 Chapter 3: Using Teradata Warehouse Miner Analysis Input Screen in which the data should be sorted, with the first column listed taking precedence. Double-clicking on any given column changes the sort mode from Asc(ending) to Desc(ending). The Sort Columns dialog may be resized by clicking on one of the edges or corners and moving the mouse while holding the button down. ∘ Export — Clicking Export exports the data in the data grid to a new instance of a Microsoft Excel worksheet, which the user may then use to manipulate, print and/or save the data. If no rows are selected, all of the displayed rows are exported. If one or more rows are selected, only the selected rows are exported. If more than one Results table is available, an option is provided to export all of the tables at once rather than just the table being viewed. When this option is selected, all of the data available for all of the tables in the Results viewer is exported to a single worksheet, with blank rows separating the rows for each results table. (If this option is selected, any row selections in the table being viewed are ignored, taking all of the available rows).

Note: The Results viewer limits the number of rows available for each result table to the limit determined by the Tools > Preferences option Maximum result rows to display on the Limits tab, a value that is by default 1000 rows per result table. This value must be increased in order to export more data to Excel. Note, however, that Excel will not accept more than 65,536 rows or 256 columns in total, and will ignore rows and/or columns beyond these limits (after loading the rows or columns up to these limits). • Graphs — The Graphs tab displays any visualizations created by an analysis. ∘ Drill Down — The graphs for the Descriptive Statistics functions Frequency, Histogram, Values, Statistics and Data Explorer additionally have graph drill down capabilities. Drill down queries can be saved with the analyses that create them, but are not retained when an analysis is exported to a file. Available graphs and their drill down capabilities, if any, are described in detail in the sections that describe results graphs for the analyses that provide them.

Note: Drill Down will not work properly when the table being analyzed is a volatile table created by a referenced analysis that doesn't create an output table or view.

The following right-click menu options are available from the display of drill down data rows after drill down is performed. They can be used to copy data to Microsoft Excel or to the Clipboard. When copying data to the Clipboard, columns are delineated by tab characters. ▪ Export to Excel - All Rows ▪ Export to Excel - Selected Rows ▪ Copy (to Clipboard) - All Rows ▪ Copy (to Clipboard) - Selected Rows In addition, all Teradata Warehouse Miner graphs have the following properties that can be selected by right-clicking on the graph to bring up a menu of options: ∘ Maximize — With the exception of the Decision Tree Browser graphical object, all Teradata Warehouse Miner graphs can be maximized to the full size of the screen. In order to maximize a graph, right-click on a graph and select the Maximize option. ∘ Print — All graphs can be printed directly from the graph option menu by selecting the Print option. If this option is selected a similar print dialog is displayed.

94 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Analysis Input Screen ∘ Export — Graphs can be exported to the Microsoft Clipboard, saved to a file, or printed in a variety of formats and styles through the following dialog. The dialog dynamically displays appropriate choices based on the selected export format, as illustrated below. Graph Export

This dialog is shown below: Exporting dialog

∘ Export Format — In general, graph images can be saved to file or Clipboard using either MetaFile, BMP or JPG formats, and printed using the MetaFile format. All graph aspect ratios (width / height) need to remain between 0.333 and 10.0 to preserve image quality. Default image sizes reflect currently displayed graph dimensions.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 95 Chapter 3: Using Teradata Warehouse Miner Analysis Input Screen MetaFiles are images compressed using a Microsoft vector-based codec that allows images to be resized without distortion or loss of detail when the aspect ratio is maintained. The MetaFile ‘No Specific Size’ option creates images using the current graph aspect ratio (width / height). If the aspect ratio needs to be modified, then graph size dimensions can be specified, as well. MetaFiles can be printed either full-page or by specifying a size using millimeter, inches or points units of measure. When the Print option is selected, clicking Export displays an additional printer dialog so the target printer can be selected. Bitmaps and JPEG encoded images saved to file or Clipboard can be sized using Pixel dimensions. ∘ MetaFile — Export the graph as a Microsoft Metafile. ∘ BMP — Export the graph as a bitmap file. ∘ JPG — Export the graph as a JPEG file. ∘ Text/Data Only — Export the graph as text only. Only Histogram and Frequency graphs support the exporting of data and column labels to text files (and these only when not accessed from the Data Explorer thumbnail graphs). All data displayed in the flex grid (graph options tab) can be exported or just the selected rows. Exported text files store the data in list formats (with tab or comma delimiters) or alternately in table formats. ∘ Export Destination ▪ ClipBoard — Export the graph to the Microsoft Clipboard. ▪ File — Export the graph to a file. ▪ Printer — Export the graph to a system attached printer. ∘ Graph Size — Specify the Graph Size of the Windows Metafile, in either: ▪ No Specific Size — A resolution independent dimensioning of the graph image. ▪ Millimeters — Specify the image size in millimeters. ▪ Inches — Specify the image size in inches. ▪ Points — Specify the image size in points. Or, specify the Graph Size of the BMP or JPEG, in: ∘ Pixels — Specify the image size and aspect ratio in pixels. Or, specify how the text should be exported: ∘ List ▪ Tab — Export the data in the data grid in a tab delimited format. ▪ Comma — Export the data in the data grid in a comma delimited format. ∘ Table — Export the data in the data grid as a Table. ∘ Data to Export ▪ All Data — Export all the data in the data grid. ▪ Selected Data — Export just the data currently selected in the data grid. Click on the Export button to perform the selected export operation, or Cancel to cancel the Save Graph and return to Teradata Warehouse Miner. ∘ Zoom Out (All Data) — Often times graphing tremendous amounts of data require the graphs to have scroll bars. If this is not the desired visualization, and you want to see all the data within a single screen snapshot, the Zoom Out (All Data) option can be used. This will produce a compressed image that is guaranteed to be displayed without having to scroll.

96 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses Executing TWM Analyses Teradata Warehouse Miner projects and analyses can be executed as they are created from the front-end, or saved and executed at a later date, either from the front-end or in “batch-style” mode.

On-Line Execution Analyses or projects can be executed from the front-end by simply highlighting the project or analysis, and clicking Run, selecting the Project Menu > Run , right-clicking on the analysis name and selecting the Run option, or pressing the Function F5 key. If an analysis is executing, other analyses can still be executed any open project. While an analysis is being executed, the analysis name will be highlighted gray and an “Executing” tag will display to its right, and status messages will begin to appear in the Execution Status area. Projects can be executed in the same manner. Highlight a project and select one of the execution mechanisms as specified above. Analyses are executed in the sequence in which they appear in the project (with the possible exception that if an analysis refers to one or more analyses for its input, the analyses it refers to are executed before it is executed). Project execution ends when the last analysis within the project completes. If an error occurs during the execution of the project, all execution normally will stop, and subsequent analyses will not be executed. The user may, however, request that project execution continue on error by setting the Continue project execution upon analysis error option on the Execution tab access from the Tools > Preferences menu option.

Batch Execution Teradata Warehouse Miner projects and analyses can be executed in batch mode using an XML file as input to the TWM application. The projects to be executed can be either existing projects or new projects as defined in the XML. In addition, new analyses of type Data Explorer, Refresh Analysis, Frequency, Histogram, Statistics and Values can be defined. All analyses that have output properties defined can use the XML structure to modify an existing analysis in order to change any of the output properties. To invoke execution of Teradata Warehouse Miner, use the command prompt and enter the following: \TWM.exe” “Fully Qualified XML Batch File Name” The various elements and attributes are described below.

ErrorLog The ErrorLog element is optional. If it is not specified and one or more errors occur, they will be logged to the TWM application folder, C:\Document and Settings\\Local Settings \Application Data\TWM, with the default name of TWMBatchErrorLog.txt. The ErrorLog element, if specified, takes a single attribute called fileName, which is the name of the fully qualified file where errors will be logged to. If the path specified in the fileName is invalid, then any errors will be logged to the TWM application folder with the default batch file name.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 97 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses Silent The Silent element is optional and is used to specify that the run will be in silent mode, the front end will not be displayed, and the product will exit when all of the projects specified in this XML file are executed. This is the preferred method for most batch processing. The Silent element takes a single attribute, fileName, that is optional but recommended, and specifies the fully qualified file that will record output from the silent session. Note, however, that if the silent log path is invalid, TWMSilentLog.txt is created in the TWM application directory.

Note: In silent mode, some analysis options such as not storing the output in a table, view or procedure, or only generating (but not executing) the SQL have limited or no value because, in these cases, the results will never be seen.

If the Silent element is not specified, the projects specified will be loaded and the last loaded project may optionally be executed in interactive mode. Note that the XML elements to define new analyses or to modify existing analyses are not supported in non-silent mode, nor is the execution of individual analyses or the run-to-end option.

Connection The Connection element takes multiple attributes: name, type, userid, password, account, metadatadatabase, resultdatabase, statisticaltestdatabase, publishdatabase, advertisedatabase and alwaysadvertise (=true/false). The name and type attributes are required. • The name attribute is the name of the connection for the particular type of connection being requested. The name attribute is the name of the connection for the particular type of connection being requested. • The type attribute corresponds to the four different types of connection that TWM supports: System, User, File, and Simple. UserId and password attributes are needed only if the information can not be obtained from the connection. The four database attributes are optional as well. They only need to be set if the user wishes to override the ones specified in the Connection Properties. The account attribute is completely optional. If the connection type is System or User, the connection name must match a defined ODBC datasource of the appropriate type (System or User). If the specified connection name is not found in the TWM connections saved from interactive sessions, a warning is given and default values are used for metadatadatabase, resultdatabase, statisticaltestdatabase, publishdatabase and advertisedatabase, unless overridden through the connection elements listed above. An example of specifying a user connection using the existing data source name “MyConnection” which defines the userid and password in the data source and overriding the result database to “MyResultDatabase” follows:

Project The Project element takes a name attribute, and the following optional attributes: new, execute, save, runToEnd, analysisName and continueOnError. This element tells TWM to load the specified project by name from the current connection.

98 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses If the new attribute is used, a new project with the specified name will be created. Note that a new project may contain only new analyses of type Frequency, Histogram, Statistics, Values and/or Data Explorer. Analyses of type Refresh Analysis that are specified as new need to be referenced from an existing, not a new project, because they must reference an existing analysis in the same project. Also, only one new project is allowed to be created in batch mode execution. If the new attribute is set, then it is assumed that all analyses specified for the project are new analyses. If the save attribute is set and the project is running in silent mode, then the project will be saved in the TWM metadata. If save is set to false, the analysis will not be saved and will not be available to be accessed for future runs. The default for the save attribute is true. The continueOnError attribute will continue execution of the project’s analyses if the execute attribute has been specified and one or more of the project's analyses fails to complete properly. If this attribute is not specified, then the Continue project execution on analysis error preferences option setting is used. The runToEnd attribute allows the project to start execution at a specific analysis and continue to the last analysis that is defined for that project. If the runToEnd attribute is set, then the analysisName attribute must also be provided to specify which analysis name the execution is to begin at. To execute an existing project, execute and save it, use: The user can specify an individual analysis to be executed by setting the analysisName attribute (causing the first analysis in the project with the requested name to be selected). If more than one analysis is to be run in a project, then the user would need to specify the project and analysis pair for each analysis to be run. For example: When an analysis name is specified with the runToEnd attribute, all analyses following the requested analysis in the project are executed after the requested analysis. In this case, however, analysis specifications for new and modify cannot also be used. If an analysisName is not specified, then all the analyses in the project will be executed. There is an alternative method, described below, that enables the user to specify and execute multiple analyses within one or more projects with the ability to modify output properties or run the analysis without modification. Examples of this type of batch execution are provided in Examples of Batch XML Files.

Analysis The Analysis element, if specified within a Project element, takes a name attribute, and the following optional attributes: new, modify, and type. If the new attribute is specified, then the type of the analysis to be created must be set using the type attribute, along with the analysis parameters that will define this new analysis. Analysis types of Data Explorer, Refresh Analysis, Frequency, Histogram, Statistics and Values are supported as new analyses. Note that if the modify attribute is specified, the first analysis with a matching name attribute is selected if there is more than one. Also, a given analysis can only be executed once within a single project definition.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 99 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses Batch File Validation Validation of the XML attributes and elements is attempted wherever possible. An invalid path for the error log is logged to the silent log, if there is one, and to the default error log. The Connection elements themselves, if invalid, will appear as ODBC errors in the error log.

Note: Take care not to introduce formatting characters into the XML parameter file through the use of an XML editor.

Examples of Batch XML Files Several examples of batch XML files are given here, showcasing types of analyses that may not be available in all Teradata Warehouse Miner products, specifically in Teradata Profiler and Teradata ADS Generator. An example XML file to modify the output attributes and post processing attributes of an existing Decision Tree score analysis follows.

An example XML file to create a new project and a new Data Explorer analysis follows. See the Data Explorer documentation in the help file or Data Explorer for the definition of each of the XML elements and attributes relating to the Data Explorer analysis.

100 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses

An example XML file that specifies multiple existing projects and modifies and executes the output properties of the first and just executes the second follows. sql="sel * from twm.MyBuildADS order by 1;" collectStats="true"/>

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 101 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses

Supplied Example Batch Files Examples of batch execution files are provided with the product in conjunction with the help system. The Help Menu item to Import Tutorial Projects makes available projects that can be imported and optionally executed provided the demonstration tables have been installed (see the TWM program item Load Demonstration Data). A folder containing examples of batch XML files is provided in the same location as the tutorial import files. These XML files may need to be modified before they can be used. They assume that a ‘system’ data source with name ‘twm batch demo’ has been created, and they also use the C:\Temp folder to store messages and any errors that result from the batch run. To execute one of these example batch scripts, open a command window and change the current directory to the location of the tutorial scripts. For example: cd "\Scripts\Tutorials\BatchXMLFiles" ..\..\..\TWM.exe ModifyDataExplorerColumns.xml

Modifying Output Batch Properties And Post Processing Properties For all analysis categories of type ADS, Descriptive Stats, Reorganization, Scoring and Statistical Tests, batch output properties and post processing properties can be modified using the batch XML interface. All the analyses will use the following batch properties, except for Data Explorer, Sample, Scoring and PMML Scoring, which have their own definition of these properties and are defined below. To modify an existing analysis with categories of type ADS, Descriptive Stats, Reorganization, and Statistical Tests, the output and post processing properties are defined below. Note that post processing properties are only defined when the output style is create table or view, unless otherwise stated. Also, OutputStyle should be set first before other output properties if it is being changed or if the analysis is new. Most output properties only apply when the OutputStyle option is CreateTable or CreateView, with exceptions noted below.

Note: The convention of representing the output table as '' requires some awkward syntax in XML, as in the following example where the trailing SQL create index("income") on ;

is represented as trailingSql="create index("income") on <T>;"/>

An XML example of modifying the Output Properties of an existing ADS analysis follows.

102 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses Teradata Database The output properties and their possible values when connected to a Teradata database follow: • OutputStyle — “CreateTable”, “CreateView”, “Select” or “Explain” • OutputDatabase — name of existing database • OutputName — name of output table to be created • GenerateSqlOnly (valid with all values of OutputStyle) — “true” or “false” • AdvertiseOutput — “true” or “false” • AdvertiseNote — optional comment or note for advertised output (only used if AdvertiseOutput is “true” or if the Connection property AlwaysAdvertise is “true”) The post processing properties, valid only for outputStyle “Create Table” or “CreateView”, and their possible values are: • TrailingSql — the post processing SQL command(s) to execute.

Aster Database The output properties and their possible values for Descriptive Statistics functions when connected to an Aster database follow: • OutputStyle — “CreateTable”, “CreateView”, “Select” or “Explain” • OutputSchema — name of existing schema • OutputName — name of output table to be created • GenerateSqlOnly (valid with all values of OutputStyle) — “true” or “false” • AdvertiseOutput — “true” or “false” • AdvertiseNote — optional comment or note for advertised output (only used if AdvertiseOutput is “true” or if the Connection property AlwaysAdvertise is “true”) The post processing properties, valid only for outputStyle “Create Table” or “CreateView”, and their possible values are: • TrailingSql — the post processing SQL command(s) to execute. The output properties and their possible values for ADS and Reorganization functions (including Sample) when connected to an Aster database are: • OutputStyle — “CreateTable”, “CreateView”, “Select” or “Explain” • OutputSchema — name of existing schema • OutputName — name of output table to be created • GenerateSqlOnly (valid with all values of OutputStyle) — “true” or “false” • AdvertiseOutput — “true” or “false” • AdvertiseNote — optional comment or note for advertised output (only used if AdvertiseOutput is “true” or if the Connection property AlwaysAdvertise is “true”) • TableUsage — “FACT” or “DIMENSION” • TableDistribution — “BYHASH” or “BYREPLICATION” • TablePersistence — “PERMANENT” or “ANALYTIC” • TableStorage — “ROW” or “COLUMN” • TableCompression — “LOW”, “MEDIUM”, “HIGH” or “NONE” • HashDistributionColumn — name of column to hash distribute by The post processing properties, valid only for outputStyle “Create Table” or “CreateView”, and their possible values follow:

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 103 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses • TrailingSql — the post processing SQL command(s) to execute.

Modifying Output Batch Properties For Scoring Algorithms Modifying scoring analyses other than PMML scoring is defined below. Post Processing definitions are the same as defined above. The output properties for scoring analyses and their possible values are: • OutputDatabase — name of existing database • OutputName — name of output table to be created • Fallback — “true” or “false” • GenerateSqlOnly — “true” or “false” • AdvertiseOutput — “true” or “false” • AdvertiseNote — optional comment or note for advertised output (only used if AdvertiseOutput is “true” or if the Connection property AlwaysAdvertise is “true”) An XML example of modifying the Output Properties of an existing Logistic Score follows.

Modifying Output Batch Properties For PMML Scoring The output properties needed to modify an existing PMML scoring analysis follow. Post Processing properties are the same as defined above. • OutputDatabase — name of existing database • OutputName — name of output table to be created • Fallback — “true” or “false” • GenerateSqlOnly — “true” or “false” • MaxStatementSize (valid if GenerateSqlOnly is “true”) — the integer value of the maximum statement size • GenerateStoredProcedure (no longer valid) — This parameter is no longer valid. Stored procedure output for PMML may now be requested using a Refresh Analysis, referencing a scoring analysis. • StoredProcedureName (no longer valid) — This parameter is no longer valid. See GenerateStoredProcedure above. • AdvertiseOutput — “true” or “false” • AdvertiseNote — optional comment or note for advertised output (only used if AdvertiseOutput is “true” or if the Connection element AlwaysAdvertise attritbute is “true”) An XML example to modify an existing PMML score follows:

104 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses collectStats="true" outputComment="My Comment"/>

Modifying Output Batch Properties And Post Processing Properties For Sample Analysis Listed below are the output properties to modify an existing Sample analysis. The output properties and their possible values follow: • OutputStyle — “Select” “CreateTable”, “CreateMultipleTables, CreateMultipleViews” • OutputDatabase — name of existing database (valid for “CreateTable”, “CreateMultipleTables”, “CreateMultipleViews”) • OutputName — name of output table to be created (valid for “CreateTable”, “CreateMultipleViews”) • Fallback (valid for “CreateTable”, “CreateMultipleTables”) — “true” or “false” • Multiset (valid only if “CreateTable”, “CreateMultipleTables”) — “true” or “false” • GenerateSqlOnly — “true” or “false” • AdvertiseOutput (valid for all Create options) — “true” or “false” • AdvertiseNote (valid for all Create options) — optional comment or note for advertised output (only used if AdvertiseOutput attribute is “true” or if the Connection element AlwaysAdvertise attribute is “true”) An XML example to modify the SampleOutputProperties to “CreateTable” follows.

For “CreateMultipleTables”, each group of tables is defined by the following XML structure.

For “CreateMultipleViews”, each group of views is defined by the following XML structure.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 105 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses The post processing properties, valid only for outputStyle “Create Table”, “CreateMultipleTables” or “CreateMultipleViews”, have the following possible values: • OutputComment — the output comment string to be used • CollectStats (valid for “CreateTable”, “CreateMultipleTables”) — “true” or “false” • CollectSample (valid for “CreateTable”, “CreateMultipleTables”) — “true” or “false” • TrailingSql — the post processing SQL commands to execute

Creating a New or Modifying an Existing Data Explorer Analysis In order to create a new or modify an existing Data Explorer analysis, the following analysis properties are defined.

Analysis Properties • Type — “Data Explorer” (needed only if “new” is “true”) • Name — the name of the new Data Explorer analysis or the name of an existing data explorer analysis to modify • New — “true” (needed to define a new Data Explorer analysis) • Modify — “true” (needed to modify an existing Data Explorer analysis) InputDataProperties need to be defined if this is a new analysis. InputDataProperties can take a list of tables (multi-table) or a list of columns. If the analysis is being modified, the InputDataProperties can be redefined. They will replace the existing set of tables or columns that were originally defined for the analysis. If multi-table input is used, the Table Properties need to define the name of the tables and databases and an optional where clause associated with each table. For each table to be defined, the following table attributes are available.

Input Data Properties (multi-table input) • Name — the name of the existing table • Database — the name of the database that the table is defined in • WhereClause (optional) — the value of the where clause associated with the database/table An XML example to define multi-tables for a new analysis follows.

If column input is used, the Column Properties need to define the name of the tables and databases. For each column to be defined, the following column attributes follow.

Input Data Properties (column input) • Name — the name of the column • Table — the name of the table associated with the column • Database — the name of the database that the table is defined in

106 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses An XML example to define columns for a new analysis follows.

Input Data Analysis properties can be specified to override the default values for a new Data Explorer analysis or to modify the values of an existing Data Explorer analysis.

Input Data Analysis Properties • ComputeUniques — “true” to get the unique values for the values analysis. Default value is “false”. • StatisticsOptions — “minimumSet” or “all” for the statistics analysis. Default is “minimumSet”. • StatisticalMethod — “population” or “sample” used for the statistics analysis. Default is “population”. • HistogramStyle — “bins” or “quantiles” used for the histogram analysis. Default is “bins”. An XML example to define InputDataAnalysis properties follows.

Output properties can be specified to override the default names of the output tables for the Data Explorer analysis, as well as defining or modifying other output properties.

Output Properties • OutputDatabase — name of existing database • OutputFrequencyTableName — name of frequency output table to be created • OutputHistogramTableName — name of histogram output table to be created • OutputStatisticsTableName — name of statistics output table to be created • OutputValuesTableName — name of values output table to be created • Fallback — “true” or “false” (default is false) • Restart — “true” or “false” (default is false) • StoredProcedureEnabled — “true” or “false” (default is false) • ProcedureName — name of stored procedure to be created after execution, if StoredProcedureEnabled is set to true • ProcedureComment — comment associated with stored procedure to be created after execution, if StoredProcedureEnabled is set to true • AdvertiseOutput — “true” or “false” • AdvertiseNote — optional comment or note for advertised output (only used if AdvertiseOutput is “true” or if the Connection property AlwaysAdvertise is “true”) An XML example to define the OutputProperties follows.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 107 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses outputHistogramTableName="MyHisto" outputStatisticsTableName="MyStats" outputValuesTableName="MyValues"/>

Expert Properties For the Data Explorer analysis, the expert where clauses are defined in a separate XML structure when using column input for the analysis. Where clauses take the following attributes and values: • Table — name of table associated with an input table defined for a column in the XML Columns structure • Database — name of database that the table is defined in • WhereClause — the where clause to be used An XML example to define the Expert option where clause(s) follows.

Sample XML Definition for a New Data Explorer Analysis An XML example defining a new Data Explorer analysis using Multi-table input follows.

Sample XML Definition to Modify an Existing Data Explorer Analysis

108 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses database="twm_source"/>

Creating a New or Modifying an Existing Refresh Analysis In order to create a new or modify an existing Refresh analysis, the following analysis properties are defined:

Analysis Properties • Type — “Refresh Analysis” (needed only if “new” is “true”) • Name — the name of the new Refresh analysis or the name of an existing Refresh analysis to modify • New — “true” (needed to define a new Refresh analysis) • Modify — “true” (needed to modify an existing Refresh analysis) Input Data Analysis properties can be specified to override the default values for a new Refresh analysis or to modify the values of an existing Refresh analysis.

Input Data Analysis Properties • Analysis — the name of the new Refresh analysis or the name of an existing Refresh analysis to modify • ModifyOutput — “true” if the output properties are to be defined or modified. Default is “false”. • OutputDatabase — the name of the existing database. ModifyOutput must be set to “true”. • OutputTable — the name of the table. ModifyOutput must be set to “true”. • ModifyAnchorTable — “true” if the anchor table is to be defined or modified. Default is “false”. • AnchorDatabase — the name of the existing database. ModifyAnchorTable must be set to “true”. • AnchorTable — the name of the anchor table. ModifyAnchorTable must be set to “true”. • ModifyTargetDate — “true” if the target date is to be defined or modified. Default is “false”. • TargetDate — the target date to be defined or modified. ModifyTargetDate must be set to “true”. • ModifyLiteralParameters — “true” if the literal parameters associated with the analysis to be refreshed are to be modified or defined. Default is “false”. • GenerateSqlOnly — “true” or “false” (default is false) • StoredProcedureEnabled — “true” or “false” (default is false) • ProcedureName — name of stored procedure to be created, if StoredProcedureEnabled is set to true

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 109 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses • IncludeProcedureParameters — “true” or “false” (default is false), valid if StoredProcedureEnabled is set to true • ProcedureComment — comment associated with stored procedure to be created, if StoredProcedureEnabled is set to true • AdvertiseOutput (only valid if the analysis being refreshed creates a table or view) — “true” or “false” • AdvertiseNote — optional comment or note for advertised output (only used if AdvertiseOutput is “true” or if the Connection property AlwaysAdvertise is “true”) • RetainAllColumns — Value: “true” or “false”. Default is false. If ModifyLiteralParameters has been set to “true” in the input data analysis properties, the following XML structure defines each literal parameter that is to be modified:

Literal Parameters • Name — the name of the literal parameter to be modified • Value — the value of the literal parameter. The type of the parameter must be consistent with the definition of its original type. An example of the XML definition to modify three literal parameters follows.

An XML example to define a new Refresh analysis follows:.

Creating a New or Modifying an Existing Frequency Analysis In order to create a new or modify an existing Frequency analysis, the following analysis properties are defined:

Analysis Properties • Type — “Frequency” (needed only if “new” is “true”) • Name — the name of the new Frequency analysis or the name of an existing Frequency to modify • New — “true” (needed to define a new Frequency analysis)

110 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses • Modify — “true” (needed to modify an existing Frequency analysis) InputDataProperties needs to be defined if this is a “new” analysis. InputDataProperties takes a database, table, and a list of columns. If the analysis is being modified, the InputDataProperties can be redefined. They will replace the existing set of columns that were originally defined for the analysis.

Column Input Data (must be defined for any ‘frequency style’) • Database — the name of the database • Table — the name of the table • Columns — a list of column names ∘ Name — the name of the column An XML example to define columns for a new analysis follows. Database="twm_source" Table="twm_customer">

Statistics Column Input Data (optional for frequencyStyle=”basic”) • Statistics Columns — a list of column names of numeric or date type ∘ Name — the name of the column to collect statistics on An XML example to define Columns and StatisticsColumns when frequencyStyle=”basic” follows. Database="twm_source" Table="twm_customer">

Pairwise Column Input Data (optional for frequency style=”pairwise”) • Pairwise Columns — a list of column names ∘ Name — the name of the pairwise column An XML example to define Columns and PairwiseColumns when frequencyStyle=”pairwise” follows. Database="twm_source" Table="twm_customer">

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 111 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses Input Data Analysis properties can be specified to override the default values for a new frequency analysis or to modify the values of an existing frequency analysis.

Input Data Analysis Properties • FrequencyStyle — “basic”, “pairwise”, or “crosstab”. Default is “basic” • IncludeMinimumPercent — “true” or “false”. Default is false. • MinimumFrequencyToReturn (defined only if IncludeMinimumPercent=”true”) — the decimal or integer value of the minimum frequency to return • IncludeCumulativeMeasures — “true” or “false”. Default is false. • TopRankingResultsToReturn (defined only if IncludeCumulativeMeasures=”true”) — the integer value of the top ranking results to return An XML example to define InputDataAnalysis properties follows.

Output Properties For the definition of output properties, see Modifying Output Batch Properties And Post Processing Properties.

Expert Properties • WhereClause — the where clause to be defined • HavingClause — the having clause to be defined (only valid if IncludeMinimumPercent is not set to “true”) • QualifyClause — the qualify clause to be defined (only valid if IncludeCumulativeMeasures=”true” and TopRankingResultsToReturn is not set) An XML example to define Expert properties follows. whereClause=”age>50”/>

Sample XML Definition for a Frequency Analysis

112 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses outputStyle="CreateTable" outputDatabase="twm_results" outpuName="MyFrequencyOutput"

Creating a New or Modifying an Existing Histogram Analysis In order to create a new or modify an existing Histogram analysis, the following analysis properties are defined:

Analysis Properties • Type — “Histogram” (needed only if “new” is “true”) • Name — the name of the new Histogram analysis or the name of an existing Histogram to modify • New — “true” (needed to define a new Histogram analysis) • Modify — “true” (needed to modify an existing Histogram analysis) InputDataProperties needs to be defined if this is a “new” analysis. InputDataProperties takes a database, table, and a list of various column definitions, depending on the type of histogram desired. Histogram takes 5 types of column input. They define a collection of bins, quantiles, widths, boundaries, or bins with minimum and maximum values.

Column Input Data • Database (required for any histogram type) — the name of the database • Table (required for any histogram type) — the name of the table The following defines each type of Histogram input: BinColumns — a list of bin column names and their bin values • BinColumn ∘ Name — the name of the column ∘ Bins — the number of bins An XML example to define bin columns follows. Database="twm_source" Table="twm_customer">

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 113 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses QuantileColumns — a list of bin column names and their bin values • QuantileColumn ∘ Name — the name of the column ∘ Quantiles — the number of quantiles An XML example to define quantile columns follows.

WidthColumns — a list of column names and their width values • WidthColumn ∘ Name — the name of the column ∘ Width — the width values An XML example to define width columns follows.

BoundaryColumns — a list of column names and the list of boundary values for each column • BoundaryColumn ∘ Name — the name of the column • Boundaries ∘ Boundary Value — the value of each boundary An XML example to define boundary columns follows.

114 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses

BinsMinMaxColumns — a list of column names that includes the bins and minimum and maximum values • BinMinMaxColumn ∘ Name — the name of the column ∘ Bins — the number of bins ∘ MinimumValue — the minimum value to be binned for the column ∘ MaximumValue — the maximum value to be binned for the column An XML example to define boundary columns follows:

An XML example to define BinColumns, StatisticsColumns and OverlayColumns for a new analysis follows. Database="twm_source" Table="twm_customer">

Statistics Column Input Data (optional) • Statistics Columns — a list of column names of numeric or date type ∘ Name — the name of the column to collect statistics on

Overlay Column Input Data (optional) • Overlay Columns — a list of column names that cannot include any of the histogram binning columns ∘ Name — the name of the overlay column

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 115 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses Input Data Analysis Properties • CrossTabulateBins — “true” or “false”. Default is false. An XML example to define InputDataAnalysis properties follows.

Output Properties For the definition of output properties, see Modifying Output Batch Properties And Post Processing Properties.

Expert Properties • WhereClause — the where clause to be defined An XML example to define Expert properties follows. whereClause=”age>50”/>

Sample XML Definition for a New Histogram Analysis Database="twm_source" Table="twm_customer">

Creating a New or Modifying an Existing Statistics Analysis In order to create a new or modify an existing Statistics analysis, the following analysis properties are defined:

Analysis Properties • Type — “Statistics” (needed only if “new” is “true”)

116 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses • Name — the name of the new Statistics analysis or the name of an existing Statistics to modify • New — “true” (needed to define a new Statistics analysis) • Modify — “true” (needed to modify an existing Statistics analysis) InputDataProperties needs to be defined if this is a “new” analysis. InputDataProperties takes a database, table, and a list of numeric or date type columns. If the analysis is being modified, the InputDataProperties can be redefined. They will replace the existing set of columns that were originally defined for the analysis.

Column Input Data • Database — the name of the database • Table — the name of the table • Columns — a list of numeric or date type column names ∘ Name — the name of the column An XML example to define columns for a new analysis follows. Database="twm_source" Table="twm_customer">

GroupBy Column Input Data (optional) • GroupBy Columns ∘ Name — the name of the group by column An XML example to define Columns and GroupBy Columns follows. Database="twm_source" Table="twm_customer">

Input Data Analysis Properties • BasicOptions — “None”, “MinimumOptions”, “AllOptions”. Default is “MinimumOptions”. To define specific statistics, one of more of the following options can be selected: “Minimum”, “Maximum”, “Mean”, “StandardDeviation”, “Skewness”, “Kurtosis”, “StandardError”, “CoefficientOfVariance”, “Variance”, “UncorrectedSumsOfSquares”, “CorrectedSumsOfSquares” • ExtendedOptions — “None”, “Values”, “Quantiles”, “Rank” or “Modes”. Default is “None”. • StatisticalMethod — “Population” or “Sample”. Default is “Population”.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 117 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses An XML example to define InputDataAnalysis properties follows.

Output Properties For the definition of output properties, see Modifying Output Batch Properties And Post Processing Properties.

Expert Properties • WhereClause — the where clause to be defined An XML example to define Expert properties follows. whereClause=”age>50”/>

Sample XML Definition for a New Statistics Analysis

Creating a New or Modifying an Existing Values Analysis In order to create a new or modify an existing Values analysis, the following analysis properties are defined:

Analysis Properties • Type — “Values” (needed only if “new” is “true”) • Name — the name of the new Values analysis or the name of an existing Values to modify • New — “true” (needed to define a new Values analysis) • Modify — “true” (needed to modify an existing Values analysis)

118 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses InputDataProperties needs to be defined if this is a “new” analysis. InputDataProperties takes a database, table, and a list of numeric or date type columns. If the analysis is being modified, the InputDataProperties can be redefined. They will replace the existing set of columns that were originally defined for the analysis.

Column Input Data • Database — the name of the database • Table — the name of the table • Columns — a list off column names ∘ Name — the name of the column An XML example to define columns for a new analysis follows. Database="twm_source" Table="twm_customer">

Input Data Analysis Properties • ComputeUniqueValues — “true” or “false”. Default is “true”. An XML example to define InputDataAnalysis properties follows.

Output Properties For the definition of output properties, see Modifying Output Batch Properties And Post Processing Properties.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 119 Chapter 3: Using Teradata Warehouse Miner Executing TWM Analyses Expert Properties • WhereClause — the where clause to be defined An XML example to define Expert properties follows. whereClause=”age>50”/>

Sample XML Definition for a New Values Analysis

Analysis Validation Validation of the parameters specified for an analysis are done both in real-time and execution time. Real- time validation is done for manually entered input such as the number of bins for the Histogram function or the number of data points for the Scatter Plot function. If an invalid entry is made within a text box, a yellow indicator appears next to the invalid item. Passing the mouse over the indicator will show a tooltip with the appropriate error message. Run-time validation is done for other parameters. These can be warnings or hard errors that must be corrected before the analysis can be executed. Both warnings and errors are displayed in a pop-up dialog as follows.

120 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Stored Procedure Support (Teradata Database) Error Warning Dialog

Errors are indicated by the red circle/white ! icon, while warnings are indicated by a yellow triangle/black ! icon. If all the messages are warnings, the Continue button is enabled, and the warnings can be ignored by clicking on it. Double- clicking on any message brings the focus to the error or warning in question on the analysis screen.

Stopping an Analysis Execution Analyses can be terminated prior to normal completion by simply highlighting the name and clicking Stop on the Toolbar, or by right-clicking on the analysis name and clicking Stop.

Stored Procedure Support (Teradata Database) Output options are provided for the creation of stored procedures from the SQL generated for various types of analysis. The Teradata DBMS supports the creation of stored procedures containing SQL, control statements and optional parameters that are passed when the procedure is invoked with a CALL statement. The specific functions that provide this option, the parameters if any that are available, and the means of requesting this option are described in the following sections. • Procedure Output Storage Option (Teradata Database) • Procedure Options for Data Explorer (Teradata Database) • Procedure Options for Refresh (Teradata Database) • Miscellaneous Procedure Features (Teradata Database)

Procedure Output Storage Option (Teradata Database) The standard Output Storage options panel that is available with most types of analysis includes options to name and comment on a stored procedure, but only when the option to Store the tabular output of this analysis in the database is selected. If the selected Output Type is Table, the stored procedure is created with parameters for the output database and table. If the selected Output Type is View, the stored

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 121 Chapter 3: Using Teradata Warehouse Miner Stored Procedure Support (Teradata Database) procedure is created without parameters. If a stored procedure is requested along with the option to Generate the SQL for this analysis, but do not execute it, the SQL to create the stored procedure and optionally to add a descriptive comment to its data dictionary entry is created but not executed. When an optional Procedure Comment is entered, it can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags , and , respectively). Note that the default value of this field may be set on the Defaults tab of the Preferences dialog box that is available from the Tools > Preferences menu option. When a stored procedure is requested, a note is displayed at the bottom of the Output Storage panel to View the 'RESULTS--data' tab to see any errors/warnings creating the stored procedure. This simply points out that the creation of the stored procedure returns warning and error messages as data, which should be viewed to ensure that the stored procedure will have the desired effect. The functions that provide the Output Storage option to create a stored procedure include all ADS functions except Refresh (which has stored procedure options of its own), all Descriptive Statistics functions except Correlation Matrix, Data Explorer (which has options of its own), Text Field Analyzer and Scatter Plot, all Reorganization functions except Sample, and all Statistical Test functions. The Output Storage option is not provided for Analytic analyses (algorithms), Matrix functions, Miscellaneous (Free Form SQL), Publish or Scoring functions. Note, however, that the Sample function, Free Form SQL and various Scoring functions can produce a stored procedure using options provided by the Refresh analysis, described in Procedure Options for Refresh (Teradata Database). Some consideration must be given to any analysis that is referenced for input when the output option to create a stored procedure is requested. • If the reference is to a created table, a warning message is given that the desired results may not be achieved, since the stored procedure will not recreate the input table. • If the reference is to an analysis that does not build an output table (and therefore automatically builds a volatile table), an error is given, since the procedure cannot work correctly in this case. An exception to this is when the referenced analysis uses the Generate the SQL for this analysis, but do not execute it option, leading to the SQL being included as a derived table in the referencing analysis, in which case no error is given and all of the required SQL is included in the stored procedure. For ADS and Reorganization analyses, as well as for Free Form SQL analyses, the Refresh analysis may be used to include the SQL for all referenced analyses in the stored procedure, as described in Procedure Options for Refresh (Teradata Database).

Procedure Options for Data Explorer (Teradata Database) The Data Explorer function in the Descriptive Statistics category of analyses also provides on its custom Output Storage panel options to create and comment on a stored procedure. The creation of a stored procedure for a Data Explorer analysis has some unique features and requirements. • The procedure contains a parameter for the output database and a separate parameter for each of the output tables that may be created depending on options selected in the analysis, namely a Values, Statistics, Frequency and Histogram Output Table. • Due to the dynamic nature of the creation of SQL for the Data Explorer analysis, the stored procedure is created only after executing the SQL for the analysis. There is no option in the Data Explorer function to “generate SQL only”, whether or not a stored procedure is requested. • If the Restart option is selected, the option to create a stored procedure is not available.

122 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Stored Procedure Support (Teradata Database) Procedure Options for Refresh (Teradata Database)

Note: This section can be skipped by users of the Teradata Profiler product, since it does not include this function.

The Refresh analysis in the ADS category of analyses provides options to create and comment on a stored procedure, with or without parameters. This makes the option to create a stored procedure available to some functions for which the Output Storage option is not available, including the various Scoring functions, the Free Form SQL analysis and the Sample function. It also provides a number of capabilities not offered via the standard Output option to create a stored procedure, including the following: • Refresh includes the SQL for not only the analysis being “refreshed”, but for the entire chain of analyses, if any, referenced for input by the refreshed analysis. • Refresh not only supports parameters for output database and table but for anchor database and table, target date and literal parameters as well.

Note: For technical reasons, the inclusion of literal parameters in a generated stored procedure is provided only when the release of the Teradata DBMS is 12 or later.

Parameters for these entities are included only if they are used and replaceable in the analysis being refreshed or in a referenced analysis. In particular, parameters for anchor database and table are not replaceable in an analysis that creates a score table or is a Variable Creation or Build ADS analysis that has no anchor columns selected. • If the Output Type of the analysis being refreshed is set to View, the stored procedure is created without parameters. • When parameters are not requested, or are not provided for some reason, the values assumed within the SQL of the stored procedure are either the current values or those explicitly requested by the Refresh analysis. • When refreshing a Scoring analysis with parameters requested, parameters are included automatically for the input database and table and the output database and table, unless the Scoring analysis refers to another analysis for input, in which case only the output database and table parameters are included, along with any parameters contributed by the referenced analyses. Even if a Variable Creation marked as a score analysis contains literal parameters or target date references, they will be treated as fixed literal values and not as parameters in any stored procedure built by refreshing the analysis. • When refreshing a Scoring analysis that offers options to score, evaluate or do both, a requested stored procedure will not include evaluation SQL but will include scoring SQL regardless of the option requested in the scoring analysis. • When refreshing a Free Form SQL analysis, the analysis is treated either as an analytic data set or scoring analysis, depending on options selected in the Free Form SQL analysis, with availability of parameters determined accordingly. That is, anchor table, target date and literal parameters are possible if the Creates Score Table option is not selected in the Free Form SQL analysis, and input and output database and tables are available otherwise.

Note: Any literal parameters must be used as literals in the free form SQL and not as database objects or other constructs, in order for any generated stored procedure to work. • When refreshing a Sample analysis that creates multiple tables or views, no parameters for output database and table are included.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 123 Chapter 3: Using Teradata Warehouse Miner Advertise Output • When refreshing a Decision Tree Score analysis, the SQL to create profiling tables is not included in the stored procedure.

Miscellaneous Procedure Features (Teradata Database) There are some general features and limitations concerning generated stored procedures that should be noted. • Stored procedures are always generated in the login user database. This is necessary so that SQL Data Definition Language (DDL) statements can be included in the procedure. • Stored procedures will include any post-processing that SQL requested in an included analysis. • Some SQL constructs are not allowed in a stored procedure, including SQL to create a recursive view (an option available in the Variable Creation analysis). • To execute a stored procedure, a CALL statement may be included in a Free Form SQL analysis or in another application. For example, the following executes a stored procedure called “test_procedure” where the user is “twm”, the results database is “twm_results” and the desired output table name is “test_output”: CALL “twm”.”test_procedure” ('twm_results', 'test_output'); • If the SQL to create a stored procedure is ever placed in a Free Form SQL analysis, the option to Execute as a single statement must be used to execute it. • In order to support the use of SQL to drop and create tables, generated stored procedures include an error handling control statement to allow a table-not-found condition. This error handler however may hide other errors such as improper access rights and incorrect positional assignment lists (on inserts). The user is cautioned to check warnings and messages on the RESULTS-data tab that may alert the user to these and other errors.

Advertise Output

Overview Output options are provided to store the results of most types of analysis as one or more output tables or views. An output option is also provided in most cases to store the SQL generated by these analyses in a Stored Procedure that can easily be re-executed in the same or another application using a CALL statement. In order to make the availability of these output tables, views and stored procedures known, additional metadata tables are provided for advertised output. In addition to identifying available output tables, views and procedures, this metadata contains, if available, information about the data in output tables, the analyses that created them, and the stored procedures, if any, that are available to recreate them. Users have the opportunity to request that an output table, view or procedure be “advertised” by checking a check box and optionally entering an identifying “advertise note”. This can be done on the output panel available with most analyses, or by using the same options on the analysis parameters panel of a Refresh analysis or an Export Matrix analysis. In this way, metadata can be created about the output of any analysis that creates a permanent output table or view.

124 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Advertise Output Note: When requesting advertised output for a Refresh analysis that refreshes a scoring analysis chained to an ADS analysis, both the output of the scoring analysis and the ADS analysis are advertised. Also, if the creation of profiling tables is requested while scoring a decision tree, both the profiling tables and the score output table will be advertised if advertising is requested on the decision tree scoring output panel.

Advertise output options are stored with each analysis that they are requested for. As an alternative to requesting the advertising of output on each and every analysis, an Always Advertise option is available on the Connection Properties dialog to request that all output be advertised when using a particular datasource.

Note: This may generate a great deal of metadata during the course of using the application.

Advertise Stored Procedure Output When a stored procedure is advertised, the metadata entries reflect as far as possible the values of features and parameters that would have been in effect if a procedure wasn't created for the analysis or analyses the procedure represents. When such a procedure is executed, however, with an SQL CALL statement, parameter values may change if the procedure contains parameters, so certain metadata tables require additional entries keyed by the created table or view name. This means that every advertised procedure must contain additional coding lines to update the appropriate Advertise Output metadata tables. If it is later desired to not advertise the tables or views created by a procedure, the procedure will need to be rebuilt without selecting the “advertise output” option. This also means that in order to provide all the information for a table or view created by an advertised procedure, the metadata tables must be joined or combined in various ways to combine the entries originally made for the procedure with those made for its created table or view. This task is performed by the advertise output macros (described in Advertise Output Macros) that are created when the advertise output metadata tables and views are created.

Special Cases • When a procedure is created by a Data Explorer analysis, the profiled columns are associated equally with each of the output tables, even though all columns will typically not have been profiled in each of them. • When a procedure advertises its output and an output name containing a quote or double quote is given in the procedure call, the object won't be advertised correctly.

Note: The following cases do not pertain to the Teradata Profiler product. • When a procedure that advertises its output contains a String or Text parameter, the actual call advertised reflects the parameter evaluated as a string (for example, USER might become 'TWM'). • When calling a stored procedure that advertises output, passing a literal parameter that contains an embedded quote or double quote will typically result in either a blank field or missing rows in the advertise metadata. For example, the string 'That''s all right' or the list of values '''A'', ''B'', ''C''' when passed as a parameter would not be seen in the advertised metadata. • When a procedure advertises its output, numeric parameters are placed in quotes in the metadata. The quoted numbers work, but aren't needed when called.

Note: Advertising a procedure that contains a combined ADS and Score will delete any preceding entry for an ADS with the same name as the combined ADS at the time the procedure is created (even before executing it).

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 125 Chapter 3: Using Teradata Warehouse Miner Advertise Output • When Refresh creates a stored procedure and either advertise output or always advertise is requested, the only output advertised in the procedure is the output of the analysis being refreshed and an ADS preceding a score, if applicable. If neither advertise output nor always advertise is requested, no output is advertised by the procedure. • When Refresh does not create a stored procedure, the advertise output option overrides any setting on the analysis being refreshed and an ADS preceding a score if applicable, but not on other analyses in the chain. Of course, when always advertise is requested, all refreshed output is advertised.

Advertise Output Metadata The tables outlined below provide metadata about output tables, views and stored procedures. They must be created using the Tools > Advertise Tables Creation menu option prior to executing an analysis with output advertising requested. The database in which they are created, and the database that is used to locate them when output is actually advertised, is the Advertise Database on the Connection Properties dialog. Note that the primary key for most of these tables is the output database and table or view name, which is unique across the entire Teradata system due to the design of the Teradata data dictionary. This means that users of different TWM application metadata tables (i.e., users with different metadata databases) can, if desired, share one set of Advertise Output metadata tables. In this scenario, only one user should create the tables using the Tools > Advertise Tables Creation menu option. It is not intended, however, that the Advertise Output metadata tables be accessed directly by users, but always through provided views and macros that access the tables. Note that these views and macros are created at the same time the Advertise Output metadata tables are created. They can also be recreated separately, something that may be useful in a future release if the views and/or macros must be updated while leaving the underlying metadata tables unchanged. The provided macros always reference the metadata through the provided views, which access the underlying metadata tables after locking them for access. Depending on when the Advertise Output metadata tables were created, it may be necessary to recreate the views and macros using the feature just described in order to benefit from the potential performance improvement from locking tables for access and to make available any new macros included with the product. In addition to providing database views and macros through which to access the Advertise Output metadata, a maintenance function called Advertise Maintenance is provided on the Tools menu to maintain or view the advertised objects or to execute the supplied macros and view the results (also available as View Advertised Output on the View menu). This dialog lists the advertised objects that may be viewed, and optionally filtered to include only those of a particular category (ADS, Score, Profile or Other). When an individual advertised object is selected, detailed properties may be displayed, the SQL definition viewed or the data itself viewed if the object is a table or view. In addition, one or more advertisements may be selected for deletion, along with the underlying objects themselves if desired. Finally, a Synchronize option is provided to match up the entries with entries in the Teradata data dictionary, marking as “Missing” any objects that have been dropped from Teradata or as “Outdated” those objects that have been newly created without advertising after they were previously advertised. For more information, see Advertised Maintenance. The design of the Advertise Output metadata tables reflect the fact that TWM output tables can generally be categorized as analytic data sets, score tables or profile tables. An Analytic Data Set, or ADS, may contain data needed to build or score a predictive model, or it may contain analytic measures derived from data in the warehouse. A score table contains the results created when a predictive model (generally built by a data mining algorithm) is applied to an analytic data set. Finally, a profile table or view is essentially a report resulting from a statistical analysis of data in the warehouse, such as a Frequency or Histogram analysis.

126 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Advertise Output Special Features

Note: These special features do not pertain to the Teradata Profiler product.

• If an advertised variable has no description, then analyses chained through analysis references are searched for a description. If a description is found for a variable with a matching name, it is used in the advertised metadata. See the Get Variables section in Macro Results. Note that this is a feature unique to advertising output and is not available when publishing to the Model Manager application, for example. • Advertised variables include a column containing pseudo-SQL representing how the variable was derived, concatenating the contributions of referenced analyses, if any, separated by three vertical bars |||. Note that if a column is simply passed along from a referenced analysis this is not recorded, unless the column name changes via aliasing. See the Get Variables section in Macro Results. • Information about the columns contributing to each advertised variable is stored in the metadata, identifying the database, table and column for each contributing column. Columns that only contribute to an expert or join clause and do not contribute directly to a variable or a dimension applied to a variable are not included in this information. See the Get Variable Columns section in Macro Results. The Advertise Output metadata tables include: TWMX_DatabaseObjects * TWMX_ObjectAnalyses TWMX_AnalysisProperties TWMX_Procedures TWMX_ProcedureParameters TWMX_ProcedureObjects * TWMX_ProfileSubjects TWMX_ADSFeatures * TWMX_ADSParameters * TWMX_Variables TWMX_VariableColumns TWMX_ScoreFeatures * TWMX_ScoreColumns TWMX_ModelObjects

* updated by stored procedures The views provided to access the Advertise Output metadata tables include: TWMXV_DatabaseObjects TWMXV_ObjectAnalyses TWMXV_AnalysisProperties TWMXV_Procedures TWMXV_ProcedureParameters TWMXV_ProcedureObjects TWMXV_ProfileSubjects TWMXV_ADSFeatures TWMXV_ADSParameters TWMXV_Variables TWMXV_VariableColumns TWMXV_ScoreFeatures TWMXV_ScoreColumns TWMXV_ModelObjects

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 127 Chapter 3: Using Teradata Warehouse Miner Advertise Output Advertise Output Macros The macros provided to access the Advertise Output metadata tables include: What are available ADS tables/views/procedures and their properties and procedures? TWMX_Get_All_ADS

What are the available Score tables/procedures and their properties and procedures? TWMX_Get_All_Scores

What are Profile tables/views/procedures profiling tables in a particular database? TWMX_Get_Profiles_By_Database(subjectDatabase VARCHAR(128) CHARACTER SET UNICODE)

What are the Profile tables/views/procedures profiling a particular table? TWMX_Get_Profiles_By_Table(subjectDatabase VARCHAR(128) CHARACTER SET UNICODE, subjectTable VARCHAR(128) CHARACTER SET UNICODE)

What are columns profiled by tables/views/procedures profiling a particular table? TWMX_Get_Profiled_Columns(subjectDatabase VARCHAR(128) CHARACTER SET UNICODE, subjectTable VARCHAR(128) CHARACTER SET UNICODE)

What are the Model Variables of a particular ADS or Score by name? WMX_Get_Variables(objectDatabase VARCHAR(128) CHARACTER SET UNICODE, objectName VARCHAR(128) CHARACTER SET UNICODE)

What columns contributed to the Model Variables for this ADS or Score table/procedure? TWMX_Get_Variable_Columns(objectDatabase VARCHAR(128) CHARACTER SET UNICODE, objectName VARCHAR(128) CHARACTER SET UNICODE)

What are the literal parameters of a particular ADS by name? TWMX_Get_ADS_Parameters(objectDatabase VARCHAR(128) CHARACTER SET UNICODE, objectName VARCHAR(128) CHARACTER SET UNICODE)

What are the output columns contained in this Score table/procedure? TWMX_Get_Score_Columns(objectDatabase VARCHAR(128) CHARACTER SET UNICODE, objectName VARCHAR(128) CHARACTER SET UNICODE)

What was the analysis or chain of analyses that built this table/view/procedure? TWMX_Get_Analyses(objectDatabase VARCHAR(128) CHARACTER SET UNICODE, objectName VARCHAR(128) CHARACTER SET UNICODE)

What are the properties of the analyses that built this table/view/procedure? TWMX_Get_Analysis_Properties(objectDatabase VARCHAR(128) CHARACTER SET UNICODE,objectName VARCHAR(128) CHARACTER SET UNICODE)

What are the parameters of a particular procedure by name? TWMX_Get_Procedure_Parameters(objectDatabase VARCHAR(128) CHARACTER SET UNICODE,objectName VARCHAR(128) CHARACTER SET UNICODE)

128 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Advertise Output What are the tables/views created by a particular procedure by name? TWMX_Get_Procedure_Tables(objectDatabase VARCHAR(128) CHARACTER SET UNICODE, objectName VARCHAR(128) CHARACTER SET UNICODE)

What are all the advertised tables/views/procedures? TWMX_Get_Advertised_Objects

What are the projects that select from a particular table to build a profile, ADS or score table? TWMX_Get_Projects_Table_Input(inputDatabase CHAR(30), inputTable CHAR(30))

What are the profile, score and ADS analyses that select from a particular table? TWMX_Get_Analyses_Table_Input(inputDatabase CHAR(30), inputTable CHAR(30))

What are the profile, score and ADS analyses that select from a particular column? TWMX_Get_Analyses_Column_Input (inputDatabase CHAR(30), inputTable CHAR(30), inputColumn CHAR(30))

What project and analysis created the given table/view/procedure? TWMX_Get_Creating_Analysis(objectDatabase CHAR(30), objectName CHAR(30))

The macros are used by executing them and supplying parameters if any, with or without parameter names, such as in: EXEC "twm".TWMX_Get_Profiles_By_Table('twm_source','twm_customer');

-or- EXEC "twm".TWMX_Get_Profiles_By_Database(subjectDatabase='twm_source');

Macro Results The following tables display the results for each macro.

Get All ADS What are available ADS tables/views/procedures and their properties and procedures?

Macro Results: Get All ADS

Column Name Column Description Database The database that contains the advertised object (table, view or procedure). Name The name of the advertised object.

Kind • T — Table • V — View • P — Procedure

Comment An optional comment describing the object. Created By The database user that created the object.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 129 Chapter 3: Using Teradata Warehouse Miner Advertise Output

Column Name Column Description Created The date and time the object was created. AnchorDB The database containing the Anchor Table, if any. AnchorTable The name of the Anchor Table, if any. TargetDate The Target Date, if used in the underlying analyses. NbrLiteralParameters The number of literal parameters used in the ADS. NbrVariables The number of variables (columns) in the ADS. ProcDB If created by procedure, the database procedure resides in. ProcName If created by procedure, the name of the procedure DefaultCall Procedure call with parameters simulating the results if a procedure wasn't used. ActualCall The actual procedure call reflecting the parameters used. Start Start time of procedure execution. Stop Stop time of procedure execution. MetadataDB Metadata database containing analysis that created object. AnalysisNam The name of the analysis that created the object. AnalysisType The type of the analysis that created the object. AnalysisDesc Optional description of the analysis that created the object. AnalysisCreated Date and time the analysis was created. ProjectName Name of the project containing the creating analysis. ProjectDesc Optional description of the containing project. ProjectCreated Date and time the project was created.

Get All Scores What are the available Score tables/procedures and their properties and procedures?

Macro Results: Get All Scores

Column Name Column Description Database The database that contains the advertised object (table, view or procedure). Name The name of the advertised object.

Kind • T — Table • P — Procedure

Comment An optional comment describing the object. Created By The database user that created the object.

130 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Advertise Output

Column Name Column Description Created The date and time the object was created. InputDB The database containing the scoring input table. InputTable The name of the scoring input table. NbrModelVariables The number ADS variables used in scoring. ProcDB If created by procedure, the database procedure resides in. ProcName If created by procedure, the name of the procedure. DefaultCall Procedure call with parameters simulating the results if a procedure wasn't used. ActualCall The actual procedure call reflecting the parameters used. Start Start time of procedure execution. Stop Stop time of procedure execution. MetadataDB Metadata database containing analysis that created object. AnalysisName The name of the analysis that created the object. AnalysisType The type of the analysis that created the object. AnalysisDesc Optional description of the analysis that created the object. AnalysisCreated Date and time the analysis was created. ProjectName Name of the project containing the creating analysis. ProjectDesc Optional description of the containing project. ProjectCreated Date and time the project was created.

Get Profiles By Database What are the Profile tables/views/procedures profiling tables in a particular database?

Macro Results: Get Profiles By Database

Column Name Column Description SubjectTable The profiled table or view. Database The database that contains the advertised object (table, view or procedure). Name The name of the advertised object.

Kind • T — Table • V — View • P — Procedure

Comment An optional comment describing the object. Created By The database user that created the object.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 131 Chapter 3: Using Teradata Warehouse Miner Advertise Output

Column Name Column Description Created The date and time the object was created. ProcDB If created by procedure, the database procedure resides in. ProcName If created by procedure, the name of the procedure. DefaultCall Procedure call with parameters simulating the results if a procedure wasn't used. ActualCall The actual procedure call reflecting the parameters used. Start Start time of procedure execution. Stop Stop time of procedure execution.

Get Profiles By Table What are the Profile tables/views/procedures profiling a particular table?

Macro Results: Get Profiles By Table

Column Name Column Description Database The database that contains the advertised object (table, view or procedure). Name The name of the advertised object.

Kind • T — Table • V — View • P — Procedure

Comment An optional comment describing the object. Created By The database user that created the object. Created The date and time the object was created. ProcDB If created by procedure, the database procedure resides in. ProcName If created by procedure, the name of the procedure. DefaultCall Procedure call with parameters simulating the results if a procedure wasn't used. ActualCall The actual procedure call reflecting the parameters used. Start Start time of procedure execution. Stop Stop time of procedure execution.

Get Profiled Columns What are columns profiled by tables/views/procedures profiling a particular table?

132 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Advertise Output Macro Results: Get Profiled Columns

Column Name Column Description Database The database that contains the advertised object (table, view or procedure). Name The name of the advertised object.

Kind • T — Table • V — View • P — Procedure

Column The profiled column. Description Description of column use, for example, Group By.

Get Variables What are the Model Variables of a particular ADS or Score by name?

Macro Results: Get Variables

Column Name Column Description SequenceId Sequential integer from 0 to n-1 where n is the number of Variables in the ADS or Score Model. VarName The name of the Variable (column). VarDesc The optional description of the variable. VarSql Pseudo-SQL representing how the variable was derived, concatenating the contributions of referenced analyses, if any, separated by |||. Note that if a column is simply passed along from a referenced analysis this is not recorded, unless the column name changes via aliasing.

Get Variable Columns What columns contributed to the Model Variables for this ADS or Score table/procedure?

Macro Results: Get Variable Columns

Column Name Column Description SequenceId Sequential integer from 0 to n-1 where n is the number of Variables in the ADS or Score Model. VarName The name of the Variable (column). SourceDB Database containing the table or view that contains the contributing column. SourceTable The name of the table or view that contains the contributing column.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 133 Chapter 3: Using Teradata Warehouse Miner Advertise Output

Column Name Column Description SourceColumn The name of the column that contributed to the creation of the indicated variable.

Get ADS Parameters What are the literal parameters of a particular ADS by name?

Macro Results: Get ADS Parameters

Column Name Column Description ParameterName The name of the literal parameter. ParameterDescription The optional description of the literal parameter. ParameterType String, Numeric, Date, Time, Timestamp, or Text ParameterValue A character string representation of the original value of the parameter.

Get Score Columns What are the output columns contained in this Score table/procedure?

Macro Results: Get Score Columns

Column Name Column Description SequenceId Sequential integer from 0 to n-1 where n is the number of score output columns. ColumnName The name of the score output column. ColumnDescription If used, a generated description of the use of the score output column, for example, “Index Column”.

Get Analyses What was the analysis or chain of analyses that built this table/view/procedure?

Macro Results: Get Analyses

Column Name Column Description SequenceId Sequential integer from 0 to n-1 where n is the number of analyses that contributed to building this table/view/procedure. MetaDB Metadata database containing analysis that created object. AnalysisName The name of the analysis that created the object. AnalysisType The type of the analysis that created the object. AnalysisDesc Optional description of the analysis that created the object. AnalysisCreated Date and time the analysis was created. ProjectName Name of the project containing the creating analysis.

134 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Advertise Output

Column Name Column Description ProjectDesc Optional description of the containing project. ProjectCreated Date and time the project was created.

Get Analysis Properties What are the properties of the analyses that built this table/view/procedure?

Macro Results: Get Analysis Properties

Column Name Column Description AnalysisName The name of one of the analyses that contributed to creating the given table/view/procedure. AnalysisProperty Description of a property of an analysis as logged in error logs or displayed interactively in the Property panel.

Get Procedure Parameters What are the parameters of a particular procedure by name?

Macro Results: Get Procedure Parameters

Column Name Column Description ParmName The name of the parameter. ParmDesc The optional description of the parameter. SQLType The SQL type string of the parameter. For example, VARCHAR(30). OrigValue A character string representation of the original value of the parameter.

Get Procedure Tables What are the tables/views created by a particular procedure by name?

Macro Results: Get Procedure Tables

Column Name Column Description Database The database that contains the table/view created by the procedure. Name The name of the table/view created by the procedure. Call The procedure call statement in use when the object was created. Start The date and time when procedure execution began. Stop The date and time when procedure execution completed.

Get Advertised Objects What are all the advertised tables/views/procedures?

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 135 Chapter 3: Using Teradata Warehouse Miner Advertise Output Macro Results: Get Advertised Objects

Column Name Column Description ObjectDatabase The database that contains the advertised object (table, view or procedure). ObjectName The name of the advertised object.

ObjectKind • T — Table • V — View • P — Procedure

CreateUser The database user that created the object. CreateTimestamp The date and time the object was created. CommentString An optional comment describing the object. OutputCategory 1 for Data Set, 2 for Score table, 3 for Profile or 0 for Other. AdvertiseNote Optional free-form text that may be used to categorize the object, for example by internal project or purpose. OutputStatus This attribute is used by maintenance functions to indicate an object that is either Missing or Outdated with respect to the Teradata data dictionary. • Missing — not in the dictionary • Outdated — the dictionary has a later entry with the same name.

Get Projects Table Input What are the projects that select from a particular table to build a profile, ADS or score table?

Macro Results: Get Projects Table Input

Column Name Column Description Metadata DB Metadata database containing selecting project. ProjectName Name of the selecting project. ProjectDesc Description of the selecting project.

Get Analyses Table Input What are the profile, score and ADS analyses that select from a particular table?

Macro Results: Get Analyses Table Input

136 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Teradata Data Types (Teradata Database)

Column Name Column Description AnalysisDesc Description of the selecting analysis. ObjectDatabase Database containing the object created by the selecting analysis. ObjectName Name of the object created by the selecting analysis.

Get Analyses Column Input What are the profile, score and ADS analyses that select from a particular column?

Macro Results: Get Analyses Column Input

Column Name Column Description MetadataDB Metadata database containing selecting analysis. ProjectName Name of the selecting project. ProjectDesc Description of the selecting project. AnalysisName Name of the selecting analysis. AnalysisDesc Description of the selecting analysis. ObjectDatabase Database containing the object created by the selecting analysis. ObjectName Name of the object created by the selecting analysis.

Get Creating Analysis What project and analysis created the given table/view/procedure?

Macro Results: Get Creating Analysis

Column Name Column Description MetadataDB Metadata database containing the creating analysis. ProjectName Name of the project containing the creating analysis. ProjectDesc Description of the project containing the creating analysis. AnalysisName Name of the creating analysis. AnalysisDesc Description of the creating analysis.

Teradata Data Types (Teradata Database) Teradata Warehouse Miner analyses support most Teradata data types. All data types may be analyzed in a Values analysis. Otherwise, data types other than simple data types will typically not return meaningful results in descriptive statistics functions, and may result in an SQL error. Type DATE data can often be analyzed in numeric functions, but TIME and TIMESTAMP data will typically not return meaningful results except when transformed by date/time functions in the ADS Variable Creation analysis.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 137 Chapter 3: Using Teradata Warehouse Miner Teradata Data Types (Teradata Database) Large Number Support (Teradata Database) The data types for Big Integers (BIGINT, containing up to about 20 digits) and Big Decimals (DECIMAL type with precision greater than 18, up to 38 digits), introduced in Teradata V2R6.2, are supported in Teradata Warehouse Miner in a limited manner. This is also true for the NUMBER data type introduced in Teradata 14.00, which may be used to store floating point or large decimal data and perform calculations with a guaranteed precision of 38 digits. NUMBER(*) and NUMBER(*,n) hold floating point values, and NUMBER(m) and NUMBER(m,n) store decimal values. With some limitations (NUMBER(*) and NUMBER(*,n) are displayed as floating point type with reduced precision), it is possible to view these large number types in the results data grid and to select them in the various Values wizards, as well as to view the appropriate type information in column selectors. In general, big number columns may experience precision loss or graphing problems in Descriptive Statistics. ADS and Reorganization functions support the new types in most cases, but a warning is given about possible precision loss in the numeric transformations Rescale, Sigmoid and Z-Score. Though allowed with a warning, the big number types are not recommended for the analytic algorithms, and should be analyzed cautiously when using the Statistical Tests. The following table summarizes the support for big numbers in various types of analyses.

Support for Big Numbers per Analysis

Analysis Big Number Support Values Treats as a numeric type Statistics Operates normally but gives a warning Frequency Operates normally but use of big numbers as Statistics columns gives a warning Histogram Operates normally but gives a warning Adaptive Histogram Operates normally but gives a warning Correlation Matrix Gives a warning Overlap Supported Scatter Plot Not recommended (gives a warning) Data Explorer Operates normally but gives a warning Text Field Analyzer Supported Free Form SQL Big numbers may be selected and viewed Variable Creation Supported Variable Transformation Supported with a warning for Rescale, Sigmoid and Z-Score Build ADS Supported Refresh Not applicable Reorganization analyses Supported Association Supports big numbers as input columns Analytic (other) algorithms Gives a warning, but caution recommended

138 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Teradata Data Types (Teradata Database)

Analysis Big Number Support Statistical tests Gives a warning but caution is recommended; not recommended where numeric input is required (e.g., F(2)way Parametric Test)

When entering a big decimal literal value, such as in the Variable Creation or Variable Transformation analysis, values may be entered either in the format expected by Teradata (i.e., with a period as the decimal separator) or in the format appropriate to the current locale. However, a number separator should not be used when the decimal separator is other than a period; only a decimal separator should be used.

Interval Data Type Support (Teradata Database) The standard interval data types and literals are supported in the Variable Creation analysis and may be used to build date/time expressions, but will not return meaningful results if used directly in most descriptive statistics analyses. Interval data cannot be viewed directly in Teradata Warehouse Miner results displays unless it is first cast to a VARCHAR type.

Period Data Type Support (Teradata Database) The various non-standard Period data types and literals may be used to create Period expressions in the Variable Creation analysis. Initially, Period Time With Time Zone and Period Timestamp With Time Zone are not supported as SQL elements in Variable Creation but may be created in a free-form manner using a SQL Text element.

Note: Period data cannot be viewed directly in Teradata Warehouse Miner results displays unless it is first cast to a VARCHAR type.

User Defined Type Support (Teradata Database) Limited support is provided for User Defined Types or UDTs. This support consists primarily in providing a SQL element in the Variable Creation Analysis (not available in the Teradata Profiler product) called User Defined Method, with functionality similar to that provided by the User Defined Function SQL element. Both constructor and instance methods are supported. For instance methods, the user may apply a method to a column or expression of the same User Defined Type and may even “chain” method calls together. User Defined Type support also includes recognizing User Defined Type as a column type, displaying the name of the type when hovering over an available column for various types of analysis, and including the type name in places where type information is displayed. Note, however, that columns of type User Defined Type cannot generally be processed directly by any of the available types of analysis, with the notable exception of the Values analysis (and of course, Variable Creation and Transformation).

Note: Data of User Defined Type cannot be viewed directly in Teradata Warehouse Miner results displays unless it is first cast to a VARCHAR type.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 139 Chapter 3: Using Teradata Warehouse Miner Temporal Table Support (Teradata Database) Temporal Table Support (Teradata Database) Temporal tables, as available in Teradata 13.10, are supported in Teradata Warehouse Miner primarily through the support of temporal qualifiers in the Variable Creation analysis (not available in the Teradata Profiler product). Temporal qualifiers may precede the SELECT keyword or follow the table name, derived table or joined table in the FROM clause of a Select statement. Temporal support is also provided through the support of the Period data types, literals, functions, operators and built-in functions, as well as the EXPAND ON expert clause in the Variable Creation analysis. Refer also to the Teradata User Documentation beginning with version 13.10 of the database, particularly the volume entitled Temporal Table Support, B035-1182, as well as the volume SQL Data Manipulation Language, B035-1146, in the chapter entitled The SELECT Statement and the topic EXPAND ON Clause. When a temporal or bitemporal table is referenced in a query or used as input to an analysis, the current session Temporal Qualifier is used by default when selecting from that table, and the default session Temporal Qualifier is CURRENT VALIDTIME AND CURRENT TRANSACTIONTIME. This means that temporal tables may be analyzed by any Teradata Warehouse Miner analysis using the default session Temporal Qualifier. If a different qualifier is desired, the most straightforward solution is to create a view that incorporates the desired qualifier. For example, the SQL below would create a view incorporating the CURRENT VALIDTIME qualifier. REPLACE VIEW "twm_source"."twm_temporal_customer_view" AS CURRENT VALIDTIME SELECT * FROM "twm_source"."twm_temporal_customer";

A view such as the one defined above can be used to perform any available analysis on the underlying table in the view. Such a view definition can be built using either a Variable Creation analysis, utilizing the selectors available on the analysis parameters panel, or a Free Form SQL analysis. If it is not practical to create a view with an embedded temporal qualifier, the current session Temporal Qualifier can be changed for some types of analyses by using a statement such as the following, executed before the desired analyses in a Free Form SQL analysis. set session CURRENT VALIDTIME AND CURRENT TRANSACTIONTIME;

All of the desired analyses to be performed with this temporal qualifier must be executed in the same session, and this is not possible for those types of analysis which make their own connection or connections, including the analytic algorithms and their scoring analyses, the Matrix analysis and the Data Explorer analysis. For other types of analyses, it is still necessary to ensure that the same connection is used for the set session command and each of the analyses that depend on it, which is best guaranteed by executing an entire project that contains the Free Form SQL analysis and dependent analyses.

Geospatial Support (Teradata Database) Geospatial functions, as available in Teradata 13.10, are supported in Teradata Warehouse Miner primarily through the support of User Defined Types and User Defined Methods in the Variable Creation analysis (not available in the Teradata Profiler product). This is possible because the ST_GEOMETRY and MBR (Minimum Bounding Rectangle) geospatial data types are implemented internally in Teradata as User Defined Types. For more information on the support of User Defined Types, see User Defined Type Support (Teradata Database). In order to include a call to a Geospatial function in the SQL to create an analytic data set (as created using the Variable Creation analysis), a User Defined Method SQL element can be used, found in the Other category. For convenience, the elements ST_GEOMETRY and MBR in the Geospatial category may be used

140 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Teradata Columnar Support (Teradata Database) as an alternative. When selected, these elements result in a User Defined Method with the appropriate User Defined Type (ST_GEOMETRY or MBR) being used. One feature that is provided to make calling Geospatial functions easier is in the properties display of the User Defined Method SQL element in the Variable Creation analysis. This feature adds comments for certain functions to the parameter description display, based on any limitations on the sub-types (e.g. points, lines, and so forth) that may be processed by the method. Any limitations on the sub-type returned is also indicated in parentheses next to the result type of the method. Another feature in Variable Creation that is useful in performing geospatial analysis is the ability to create literal values that represent geospatial objects. Using the standard String and String Parameter SQL elements, geospatial objects can be represented in what is called Well Known Text (WKT) format, such as ‘POINT(0 0)’. Such literal values are sometimes used as arguments to the geospatial methods. In addition, the SQL element ‘CAST’ can be used to convert geospatial objects to WKT format strings. This is particularly important considering that geospatial data cannot be viewed directly in Teradata Warehouse Miner unless it is first cast to a VARCHAR type. A few Geospatial functions are available only as System functions or as User Defined Functions (UDFs). For example, AggGeomIntersection and AggGeomUnion are available as System functions rather than ST_Geometry methods because they are aggregate functions. Note that there is also a mechanism provided in the Variable Creation analysis to support geospatial functions that are implemented as Table functions, including GeoSequenceFromRows, GeoSequenceToRows, Tessellate and Tessellate_Search. Also of interest to note is the fact that columns of ST_GEOMETRY or MBR type (or any User Defined Type) cannot generally be analyzed directly in Teradata Warehouse Miner profile analyses or algorithms (with the notable exception of the Values analysis). In order to perform this type of analysis on geospatial data in Teradata Warehouse Miner, the data must first be converted to some other type, possibly through the use of a database view.

Teradata Columnar Support (Teradata Database) The Teradata Columnar feature was first made available in Teradata 14.00, but only when purchased as a supplementary feature. The feature is invoked using special table partitioning syntax following the primary index clause of a Create Table statement. A simple example accessing the TWM tutorial data is given below. CREATE MULTISET TABLE twm_results.columnar_test AS ( SELECT cust_id ,income ,city_name ,state_code FROM twm_source.twm_customer ) WITH DATA NO PRIMARY INDEX PARTITION BY COLUMN;

The phrase PARTITION BY COLUMN can be specified in the Additional Information text box on the OUTPUT-primary index tab of an ADS or Reorganization analysis other than Denorm or Refresh. Additional information about this screen can be found in OUTPUT Tab. If it should become necessary to specify a column attribute such as NOT NULL, this can be done using the Properties dialog of a Variable Creation or Variable Transformation analysis, but only using one of these analyses.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 141 Chapter 3: Using Teradata Warehouse Miner Calendar Support (Teradata Database) Calendar Support (Teradata Database)

Note: The Teradata Profiler product does not make use of calendar functions, but calendar views and UDF calls could be included in a Free Form SQL analysis.

Prior to release 14.00 of the Teradata RDBMS, a Teradata calendar view called CALENDAR in the SYS_CALENDAR database was the only calendar available. The Teradata Warehouse Miner product mimics the measures provided by this view, calculating them “inline” in the SQL generated by a Variable Creation analysis, as a higher performing and more convenient option than joining in the CALENDAR view. In Teradata 14.00, two new calendars were introduced in addition to the Teradata calendar, one that follows the conventions of the International Standards Organization (ISO), and the other an Oracle compatible calendar. Any of these calendars (named Teradata, ISO and Compatible) may be accessed through either the CALENDAR view previously mentioned, or through a BusinessCalendar view that can be customized to specify holidays, work days, and so forth. When either the CALENDAR or BusinessCalendar view is joined into a query, it is the session property calendar that determines whether the Teradata, ISO or Compatible calendar is used.

Note: The user may replace the CALENDAR view with a Teradata Database version for performance reasons, but the best performance using the Teradata calendar should be available using the Teradata Warehouse Miner ‘inline’ SQL Calendar functions.

In addition, almost all of the measures available by joining in the CALENDAR or BusinessCalendar view are also available by calling supplied User Defined Functions (UDFs). One of the parameters common to the BusinessCalendar UDFs is the calendar name (Teradata, ISO or Compatible), providing an alternative to setting the session calendar prior to calling the UDF or prior to executing a query that joins in the BusinessCalendar view. To support this extensive calendar functionality, the following features are provided in the Variable Creation analysis. • SQL elements for the original Teradata calendar measures, with inline SQL calculation. • SQL elements for Calendar Functions that calculate the original measures and one additional one (Week Number of Quarter) by creating calls to supplied calendar UDFs. • SQL elements for various before/after functions, such as the Sunday before a given date (these functions, introduced in Teradata 14.10, return either a date, timestamp or timestamp with time zone, as opposed to an integer). • If it is desired to execute the basic CALENDAR functions as UDFs, they may be found using the User Defined Function SQL element in the SYSLIB database (for example, TD_DAY_OF_WEEK). • If it is desired to execute calendar functions by joining to the system supplied business calendar views, the views may be selected from and joined to in the usual manner in a Variable Creation analysis. Note that some available business calendar functions are only available by joining to these views, including the following:

Views to Join Business Calendar functions to

Function Description BusinessWeekBegin and End First/last working day in which date occurs BusinessMonthBegin and End First/last working month in which date occurs

142 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner QueryGrid Support

Function Description BusinessQuarterBegin and End First/last working quarter in which date occurs BusinessYearBegin and End First/last working year in which date occurs

An additional, more general feature is provided on the Connection Properties dialog to allow the specification of one or more pre-execution commands, primarily to allow specification of the calendar to use for a session. The pre-execution commands (separated by semi-colons if more than one are specified) are executed once before a project is executed, a chain of referenced analyses is executed, or an individual analysis is executed. Note that pre-execution commands are not performed inside a stored procedure produced by the product as an output option or when performing maintenance functions or metadata queries. In order to have the desired effect when the stored procedure output option has been requested, the pre-execution commands must be performed by the user just prior to calling the stored procedure. Pre-execution commands are however inserted prior to published SQL when the Publish analysis is used to send a model to the Model Manager web-based application. They are placed either ahead of all ADS SQL, or if scoring only, ahead of all scoring SQL. Also, pre-execution commands that set session values will have no effect on analytic algorithm, scoring or matrix analyses, since these types of analysis create their own sessions independent of this feature. Finally, since the application uses session pooling when accessing the RDBMS, setting the session calendar can lead to unexpected results if care is not taken because a session with a particular setting may be reused unknowingly. For this reason, a warning is given if the pre-execution commands are changed, to the effect that the user should change or reset any previously set session parameters, or else restart the application. This warning is not given if pre-execution commands are supplied when previously there were none in effect. When publishing to the Model Manager web-based application, it is recommended that any SET SESSION commands be reset in a post-processing command on the last executed analysis, once again due to the use of session pooling.

QueryGrid Support QueryGrid support differs depending on whether you are connected to a Teradata or an Aster database. See the applicable section for details. • QueryGrid Support (Teradata Database) • QueryGrid Support (Aster Database)

QueryGrid Support (Teradata Database)

QueryGrid Introduction (Teradata Database) The QueryGrid™ feature in Teradata provides a way to read data from or write data to a remote database host. Packages are available as add-ons to the Teradata RDBMS for several varieties of foreign host, including Teradata, Aster, Oracle and several variations of Hadoop. In order to use the QueryGrid features for a particular host type within Teradata Warehouse Miner, the underlying QueryGrid package must have been purchased and installed within Teradata, and Teradata Warehouse Miner must have been certified with that particular package (as documented in the Teradata

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 143 Chapter 3: Using Teradata Warehouse Miner QueryGrid Support Warehouse Miner Release Definition, B035-2494). Although a QueryGrid package may have some unique features for a particular host type, the same basic user interface is provided for each host type.

QueryGrid Features (Teradata Database) To get started, the user or DBA must define a Foreign Server object that identifies, among other things, the host type to connect to and the import and export stored procedures to use in receiving data from or sending data to the remote host. Using a special syntax that adds an ‘@’ symbol followed by a Foreign Server name to a database name, the user may refer in selectors to a table on a remote system. The name of a foreign database using the @server syntax can be manually typed into the databases pull- down selector, or added to the source databases on the databases tab of the Connection Properties dialog, and then selected using the pull-down selector. When columns from such tables are selected for certain types of analysis, the tables that contain them appear in generated SQL with the trailing ‘@’ symbol and server name. For example, SELECT * FROM twm_customer@hadoop1. The types of analysis that support this special syntax include Descriptive Statistics (except Correlation Matrix and Scatter Plot), ADS, Reorganization and Statistical Tests. The syntax is not supported in analytic algorithms, scoring analyses, matrix analyses and graphical drill-down, although these types of analysis may be performed through the use of a view that selects from a foreign table. The Teradata-to-Hadoop QueryGrid feature allows inclusion of a RETURNS clause following the name of a foreign table that includes the @server syntax. The query with the RETURNS clause may only be specified however using a SQL text element representing a derived table (not a table function or operator) on the Tables tab in a Variable Creation analysis. Once specified on the Tables tab, the column selector on the left side of the screen can be utilized after setting the input source to Function Table and selecting the appropriate Function Table. A query with a RETURNS clause might look like the following: SELECT make, model, price FROM tdsqlh_test@hadoop1 RETURNS (make VARCHAR(2), model VARCHAR(50))

Support is also provided for the FOREIGN TABLE syntax via a SQL element in the Other category called ForeignTable. The user may drop this element on the Tables tab, set the values for the arguments (i.e., foreign host and query), and select columns from the foreign table by setting the input source to Function Tables and performing the steps described earlier. The purpose of the FOREIGN TABLE syntax is to provide a way to execute a query on the foreign host and limit the amount of data transferred over the network by including an appropriate WHERE clause. A query defined with the FOREIGN TABLE clause might look like the following: SELECT cust_id, income, age FROM FOREIGN TABLE (SELECT cust_id, income, age FROM twm_customer)@hadoop1 T1

Note that tables on foreign servers may not be specified as output tables (for example, on the Output tab), and foreign databases may not be specified as containing metadata tables on the databases tab of the Connection Properties dialog. Further, foreign tables do not offer right-click options to SHOW TABLE or SHOW TYPE in the list of available columns as part of input selectors, although a SHOW SERVER option is provided. Although inserting into a foreign table is allowed in some cases, INSERT statements may only be used in Free Form SQL analyses and post-processing fields. In addition to supporting the @server syntax, the Variable Creation analysis supports the direct invocation of the import and export table operators associated with QueryGrid where allowed (the Teradata-to- Teradata connector does not allow this, for example). It also provides access to various stored procedures associated with the QueryGrid feature as Run Units. These initially include HCTAS, which creates a table on

144 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner QueryGrid Support a Hadoop system, HDROP, which drops a table on a Hadoop system, and ExecuteForeignSQL, which executes SQL on a foreign Teradata system using the Teradata-to-Teradata connector, or on a foreign Aster system using the Teradata-to-Aster connector. Also, a table operator called AsterExecute can be used to export data from a Teradata system to an Aster server, execute a remote function and return the results. AsterExecute can be found in the Table Operators category of SQL Elements, in the Foreign Server sub- category.

QueryGrid Limitations (Teradata Database) The reference manual for each type or variation of QueryGrid describes considerations and limitations in their introductions. These should ideally be studied before making use of the feature. Specifying A Foreign Database To Appear In Input Selectors (in Teradata)

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 145 Chapter 3: Using Teradata Warehouse Miner QueryGrid Support Selecting A Foreign Database, Table And Columns For Input (in Teradata)

For Teradata-to-Aster QueryGrid, you must type in the Aster table name in the Tables combo box and then press the Tab key to load the columns for that table. This is because the QueryGrid Help Foreign Database on Aster returns only the Schemas in an Aster database. There is currently no way to retrieve the Aster tables for a schema.

QueryGrid Support (Aster Database)

QueryGrid Introduction (Aster Database) The QueryGrid™ feature in Aster provides a way to read data from or write data to a remote database host. Initially, the remote host may be a Teradata or Hadoop host. Previously, the ability to access a remote database was provided by the SQL-Map Reduce functions load_from_teradata, load_to_teradata, load_from_hcatalog and load_to_hcatalog. While this is still true, the ability to define a foreign server object has been provided so that foreign databases, tables and columns can be referenced directly in SQL statements, as described in QueryGrid Features (Aster Database).

QueryGrid Features (Aster Database) To get started, the user or DBA must define a Foreign Server object that identifies, among other things, the import and export SQL-Map Reduce functions to use in receiving data from or sending data to the remote host. Using a special syntax that adds an “@” symbol followed by a Foreign Server name to a database name, the user may refer in selectors to a table on a remote system. The name of a foreign database using the @server syntax can be manually typed into the databases pull-down selector, or added to the source

146 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner QueryGrid Support databases on the Databases tab of the Connection Properties dialog and then selected using the pull-down selector. When columns from such tables are selected for certain types of analysis, the tables that contain them appear in generated SQL with the trailing “@” symbol and server name (for example, SELECT * FROM twm_customer@hadoop1). The types of analyses that support this special syntax include ADS and Reorganization. The syntax is not supported in graphical drill-down, although drill-down and the Scatter Plot analysis may be performed with a foreign host through the use of a view that selects from a foreign table. Support is provided for the FOREIGN SERVER syntax or push-down query via a SQL element in the Other category called Foreign Table. The user may drop this element on the (Function) Tables tab, set the values for the arguments as Text elements (i.e., Foreign Host and Foreign Query) and select columns from the foreign table by setting the input source to Function Tables and then selecting the appropriate Function Table and columns. The purpose of the FOREIGN SERVER syntax is to provide a way to execute a query on the foreign host and limit the amount of data transferred over the network by including an appropriate WHERE clause. A query defined with the FOREIGN SERVER clause might look like the following: SELECT * FROM FOREIGN SERVER ($$ SELECT cust_id, income, age FROM twm_customer WHERE cust_id < 1362490 $ $)@hadoop1

In addition to supporting the @server syntax, the Variable Creation analysis supports the direct invocation of the import and export SQL-Map Reduce functions associated with QueryGrid, namely load_from_teradata, load_to_teradata, load_from_hcatalog and load_to_hcatalog.

QueryGrid Limitations (Aster Database) The reference manual for each type or variation of QueryGrid describes considerations and limitations in their introductions. These should ideally be studied before making use of the feature. • Tables on foreign servers may not be specified as output tables (for example, on the Output tab), and foreign databases may not be specified as containing metadata tables on the Databases tab of the Connection Properties dialog. Further, foreign tables do not offer right-click options to SHOW TABLE or SHOW TYPE in the list of available columns as part of input selectors. Although inserting into a foreign table is allowed in some cases, INSERT statements may only be used in Free Form SQL analyses and post-processing fields. • Descriptive Statistics functions are not supported in QueryGrid for Aster. • Foreign Server is not supported in the Sample analysis, since it is implemented as a SQL-MR Function.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 147 Chapter 3: Using Teradata Warehouse Miner QueryGrid Support Specifying A Foreign Database To Appear In Input Selectors (in Aster)

148 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner AppCenter Server Support Selecting A Foreign Database, Table And Columns For Input (in Aster)

AppCenter Server Support As described in the Aster AppCenter User Guide, AppCenter is a platform that enables developers to build, execute, and share Big-Data apps. It enables non-technical people to run the apps, visually study the data, and share their insights with others. Using the Publish Analysis, Teradata Warehouse Miner users can transform the product's generated SQL for Analytic Data Set (ADS) and Scoring analyses into Apps residing on an AppCenter server. The functionality enabled by this is quite similar to that provided by publishing to the Model Manager web-based application that is a component of the Teradata Warehouse Miner or Teradata ADS Generator products. Depending on the type of database you are connected to, AppCenter support is provided for either Aster or Teradata. The ADS and Reorganization categories of analysis, along with the Free Form SQL analysis can be used to create APPs. The Variable Creation analysis can be used to call SQL-MR functions, and can also be used to create score tables, as can the Free Form SQL analysis. While analyses in the Descriptive Statistics or Statistical Tests categories cannot be published directly, the SQL can be captured and placed in a Free Form SQL analysis and published from there. Note, however, that the SQL generated by the algorithms in the Analytics category may not be published.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 149 Chapter 3: Using Teradata Warehouse Miner Tips for Using TWM Tips for Using TWM

SQL Options Extra care should be taken when entering SQL statements (Group, Where, Order, Having and Qualify statements) in the Descriptive Statistics, Transformation and Data Reorganization analyses Expert Options. The syntax of the manual entries made in the SQL Options is not verified by the GUI, and a SQL error will occur if SQL syntax is violated.

Avoiding Teradata Warehouse Miner Keywords Much like Teradata keywords, Teradata Warehouse Miner has “special’ reserved words which may conflict with an analysis. This is particularly true for the Descriptive Statistics analyses. In many cases, such conflicts, though rare to occur, can be avoided by using an alias. In some cases, it may be necessary to define an SQL View on the table to be processed in order to rename the conflicting columns. The Teradata Warehouse Miner keywords are defined in the Results—data or equivalent section for each analysis with pre-defined output column names. In addition to avoiding those keywords defined in the Results—data section, analyzing columns which begin with “_twm” should be avoided, as this is the convention used when a Teradata Warehouse Miner analysis has to generate a temporary column name.

Teradata Object Names Teradata Warehouse Miner supports standard ASCII character object names, as well as Teradata special characters (such as White Space, Asterisk, etc.) and 8-bit characters. All SQL generated by Teradata Warehouse Miner contains double quotes around Teradata object names. Multi-byte object names are also supported.

Note: Beginning with Teradata 14.10, a system option is available in the DBS Control Record to allow object names to have up to 128 Unicode characters rather than the usual 30 characters of the default character set. When this Teradata release and option is in use, Teradata Warehouse Miner supports these Extended Object Names for input databases, tables, columns and indexes. Output databases, tables, columns and procedures are still limited to 30 characters however.

One limitation of Extended Object Name support is that, when publishing, an Input Database, Input Table, Anchor Database or Anchor Table with an Extended Option Name will be truncated to 30 characters. These names must be manually restored by the user to their full value in the Model Manager using the Edit feature.

Note: Object names with trailing spaces are not supported in Teradata Warehouse Miner. This is in part because some database help information is returned with object names padded with spaces which must be removed, making it difficult to find a match if the real object names have one or more trailing spaces.

Note: Temporary or work tables and views created by Teradata Warehouse Miner usually begin with “_twm”, so it is recommended to not begin object names with these characters.

150 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 3: Using Teradata Warehouse Miner Tips for Using TWM Precision Loss Precision loss may occur during statistical calculations on large numeric values (i.e., values that are close to the maximum value supported for the data type).

Create Table/View Options The use of the option to Store the tabular output of this analysis in the database on the OUTPUT tab, together with an Output Type of Table, results in the generation of the necessary SQL to create a Table in the named database and to insert the selected answer set into the Table. Selecting an Output Type of View results in the generation of the necessary SQL to create a View from the selected answer set. A View appears in the database like a Table, but actually stores a SELECT statement for later execution. The underlying SELECT statement to perform the analysis is not executed until data is selected from the View.

Data Formats Digit grouping is not supported when entering numbers as input parameters. For example, “1,234” may not be entered as the number of partitions in a Partition analysis, but “1234” may.

Graph Fonts It is possible that, when working with database object names containing multi-byte characters, these object names may not appear optimally in Teradata Warehouse Miner Graphs, depending on available fonts. The following are suggested fonts that can be added in various language environments to possibly improve the display of multi-byte object names in Graphs. • Japanese — MS PGothic • Korean — Gulim • Traditional Chinese — PMingLiU • Simplified Chinese — SimSun

Editing Input Screens For many analyses, the Available columns for Input must come from a single table or analysis. When modifying such an analysis by choosing a different table or analysis for input, be aware that previously identified Selected Columns will remain in the input field until you move at least one column from a newly selected input table or analysis into the Selected Columns area.

Note: An exception to this behavior is made for Variable Transformation analyses, where an error is displayed upon selection of a new column in order to protect you from inadvertently losing work.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 151 Chapter 3: Using Teradata Warehouse Miner Tips for Using TWM

152 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 CHAPTER 4 Descriptive Statistics

Overview The following sections discuss the descriptive statistics available with Teradata Warehouse Miner. These descriptive statistics provide a variety of functions to statistically analyze and explore a Teradata Database.

Descriptive Statistics The descriptive statistics available with Teradata Warehouse Miner provide a variety of functions to statistically analyze and explore a Teradata database. Descriptive statistical analysis is valuable for the following reasons. • It can provide business insight in its own right. • It uncovers data quality issues which, if not corrected or compensated for, can jeopardize the accuracy of any analytic models that are based on the data. • It isolates the data that should be used in building analytic models. For example, outlying values should sometimes be excluded from a model; in other cases, these values might be required to solve a particular business problem. Further, some statistical processes used in analytic modeling require a certain type of distribution of data. Descriptive statistical analysis can determine the suitability of various data elements for model input and can suggest transformations that may be required for these data elements. In the case of the Descriptive Statistics, NULL values are handled through the generated SQL’s aggregate functions. In this case, SQL ignores the NULL value and adjusts the number of observations in its calculation. This effectively provides a listwise deletion of NULL values. The following are the descriptive statistical functions currently available in Teradata Warehouse Miner: • Adaptive Histogram — Determine the distribution of a numeric column(s) giving counts, sub-binning column(s) with higher counts and determining data spikes. • Correlation Matrix — Build and view a correlation matrix. • Data Explorer — Automated exploration of any number of tables or views within an entire database. • Frequency — Compute frequency of column values or multi-column combined values. Optionally, compute frequency of values for pairs of columns in a single column list or two column lists, and generate simple statistics for any other column within a table. • Histogram — Determine the distribution of a numeric column(s) giving counts with optional overlay counts and statistics. • Overlap — Count overlapping column values in combinations of tables (i.e., find “key” values in common between tables). • Scatter Plot — Plot sampled values of two to three variables in 2-D or 3-D. • Statistical Analysis — Determine any of the following descriptive statistics for numeric column(s): • 1. Minimum Value

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 153 Chapter 4: Descriptive Statistics Descriptive Statistics 2. Maximum Value 3. Mean Value 4. Standard Deviation 5. Skewness 6. Kurtosis 7. Standard Mean Error 8. Coefficient of Variance 9. Variance 10. Sum 11. Uncorrected Sums of squares 12. Corrected Sums of squares 13. Values Count 14. Modal Value 15. Percentiles 16. Top 5/Bottom 5 Rank and Values • Text Field Analyzer — Analyze character data and help distinguish whether the field is a numeric type, a date, a time, a timestamp, or character data. • Values Analysis — Count the number of values of various kinds for a given column or columns, including: 1. Number of Rows 2. Rows with Non-NULL Values 3. Rows with NULL Values 4. Unique Values 5. Rows with Value ‘0’ 6. Rows with a Positive Value 7. Rows with a Negative Value 8. Rows Containing Blank Values In order to add a Descriptive Statistical analysis to a Teradata Warehouse Miner Data Mining Project, create a new analysis with any of the mechanisms described in Using Teradata Warehouse Miner. This will produce the following dialog.

154 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Values Add New Analysis dialog

Double-click or highlight the desired analysis, optionally change the default name and click OK. Each of these specific analyses are described in detail in the subsequent sections.

Values Values analysis is often useful as the first type of analysis to perform on data which is relatively unknown to the analyst. It helps determine the nature and overall quality of the data. For example, whether the data is categorical or continuously numeric, how many null values it contains, and so forth. Values analysis can readily be applied to any type of character or numeric data, even date fields. Given a table name and the name of a column, the Values analysis provides a count of the number of rows, rows with non-null values, rows with null values, rows with value 0, rows with a positive value, rows with a negative value, and the number of rows containing blanks in the given column. Optionally, unique values are calculated within the analysis as well. Note that for a column of nonnumeric type, the zero, positive and negative counts will always be zero (for example, “000” is not counted as 0). If multiple columns are requested, a VOLATILE table is built, and all columns are processed in a single CREATE VOLATILE TABLE AS SELECT… statement. Data is reformatted with individual INSERT/ SELECT statements into the final output dataset as described below. In this case, the create view option may not be requested. The Values analysis is parameterized by specifying the table and column(s) to analyze, options unique to the Values analysis, as well as specifying the desired results and SQL or Expert Options.

Note: For general information about output, see OUTPUT Tab.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 155 Chapter 4: Descriptive Statistics Values Initiating a Values Analysis Use the following procedure to initiate a new Values analysis in Teradata Warehouse Miner. 1. Click the Add New Analysis icon in the toolbar. Add New Analysis: Descriptive Statistics

2. In the resulting Add New Analysis dialog box, double-click the Values icon. Add New Analysis from toolbar

The Values dialog box appears, in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. • Values - INPUT - Data Selection • Values - INPUT - Analysis Parameters • Values - INPUT - Expert Options • Values - OUTPUT

Values - INPUT - Data Selection 1. On the Values dialog box, click on INPUT.

156 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Values Values > Input > Data Selection

2. Click on data selection. The resulting screen has the following options available: • Select Input Source — Users who are not using the Teradata Profiler program may select between different sources of input. By selecting the Table option, the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Analysis option, however, the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases, the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected Analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. For more information, see INPUT Tab.

Note: View is only available when a single column is selected. • Select Columns From a Single Table ∘ Available Databases (or Analyses) — Choose the database (or analysis) from which you will select data tables. ∘ Available Tables — Select the table from which you will select columns. ∘ Available Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. ∘ Group By Columns — Expand the Group By columns selector by clicking on the “double-up- arrow” ( ). Select columns by highlighting and then either dragging and dropping into the Group By Columns window, or click on the arrow button to move highlighted columns into the Group By Columns window. If 2 or more Group By Columns are selected, no graphs will be available.

Values - INPUT - Analysis Parameters 1. On the Values dialog box, click on INPUT. 2. Click on analysis parameters. Values > Input > Analysis Parameters

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 157 Chapter 4: Descriptive Statistics Values The resulting screen has the following option available: • Compute the number of unique values in each column — By default, the Values analysis will calculate the number of unique values within the column specified. Disabling this option removes that calculation from the analysis.

Values - INPUT - Expert Options 1. On the Values dialog box, click on INPUT. 2. Click on expert options. Values > Input > Expert Options

The resulting screen has the following option available: • WHERE Clause text — Option to generate a SQL WHERE clause(s) to restrict rows selected for analysis.

Values - OUTPUT Before executing the analysis, define output options. 1. On the Values dialog box, click on OUTPUT. Values > Output

The resulting screen has the following options: • Storage Options ∘ Use the Teradata EXPLAIN feature to display the execution path for this analysis — Option to generate a SQL EXPLAIN SELECT statement, which returns a Teradata Execution Plan. ∘ Store the tabular output of this analysis in the database — Option to generate a Teradata TABLE or VIEW populated with the results of the analysis. Once enabled, the following three fields must be specified: ▪ Database Name — Text box to specify the name of the Teradata database where the resultant Table or View will be created in. By default, this is the “Result Database.” ▪ Output Name — Text box to specify the name of the Teradata Table or View. ▪ Output Type — Pull-down list to specify Table or View. ▪ Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This creates a stored procedure in the user's login database in place of the execution of the SQL

158 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Values generated by the analysis. For more information, see Stored Procedure Support (Teradata Database). ▪ Procedure Comment — When an optional procedure comment is entered, it is applied to a requested stored procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags , and , respectively). The default value of this field may be set on the Defaults tab of the Preferences dialog box that is available from the Tools > Preferences menu option. ▪ Create output table using the FALLBACK keyword — If a table is selected, it will be built with FALLBACK if this option is selected. ▪ Create output table using the MULTISET keyword — If a table is selected, it will be built as a MULTISET table if this option is selected. ▪ Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. For more information, see Advertise Output. ▪ Advertise Note — An advertise note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog box. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. ∘ Generate the SQL for this analysis, but do not execute it — If this option is selected, the analysis will only generate SQL, returning it and terminating immediately.

Running the Values Analysis After setting parameters on the INPUT and OUTPUT screens as described above, you are ready to run the analysis. 1. To run the analysis, you can either:

• Click the Run icon on the toolbar, or • Select Run on the Project menu, or • Press the F5 key on your keyboard

Results - Values The results of running the Values analysis include the generated SQL itself, the results of executing the generated SQL, Circular and/or Bar Charts and, if the Create Table (or View) option is chosen, a Teradata table (or view). All of these results are outlined below. • Values - RESULTS - Data • Values - RESULTS - Graph • Values - RESULTS - SQL

Note: The RESULTS tab will be grayed-out (disabled) until you have run the analysis.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 159 Chapter 4: Descriptive Statistics Values Values - RESULTS - Data 1. On the Values dialog box, click on RESULTS. 2. Click on data. Values > Results > Data

Results data, if any, is displayed in a data grid as described in RESULTS Tab. The following is a description of the results returned by the analysis, depending on the options selected. If an output table is created, the columns in bold below comprise the Unique Primary Index (UPI) of the output table.

Data Results for Values

Name Type Definition COL Group By Column Type Column(s) will be created only if the “group” parameter is specified. Multiple columns, if specified, will be created or displayed as the column name. These column(s) will contain the unique values that the group by column takes on. xtbl VARCHAR (30) Table that the variable for the values operation resides in, as specified by the “table” parameter. xcol VARCHAR (30) Variable that the values operation will be run against, as specified by the “column” parameter. xtype VARCHAR (30) The data type of this variable. xcnt FLOAT The total number of occurrences of this variable. xnull FLOAT Total number of rows where this variable takes on a null value. xunique FLOAT Total number of rows where this variable takes on a unique value. xblank FLOAT Total number of rows where this variable is blank. xzero FLOAT Total number of rows where this variable is equal to 0. xpos FLOAT Total number of rows where this variable has a positive value. xneg FLOAT Total number of rows where this variable has a negative value.

160 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Values Values - RESULTS - Graph 1. On the Values dialog box, click on RESULTS. 2. Click on graph. Values > Results > Graph

Circular Graphs and/or Bar Charts are available for each column specified for the Values analysis, provided that 2 or more Group By Columns have not been selected. Some of the features of the graphs provided for the Values analysis are described in the following sections: • Values Graph Drill Down Functions • Show Graph • Graph Options

Values Graph Drill Down Functions Drill down into one or more graph values to see source data rows. Each value selected will be entered into an editable drill down list automatically (a drill down window pops up as values are clicked). All values in the list will be used as drill down criteria, resulting in a drill down data table containing rows belonging to all of the values selected. First, select graph values that target the source data of interest (such as all rows that contain positive values). There are two ways to select graph values for drill down: 1. In the Zoom View, click on one or more bars or circle graph segments to add values to the drill down list. 2. Or, send values to the drill down list by selecting rows in the data selector, right-clicking on the rows and clicking on the ‘Drill Down’ pop-up menu item. All non-null and non-zero count values will be added to the drill down list for each column selected. The drill down list can be edited after graph selections are added, and saved for future reference. Second, select the ‘drill down’ tab on the Drill Down window to query the source data table. Filtered rows (rows that fit the drill down criteria) will be returned and displayed in the drill down table. In the header of the drill down display are options to “show graph data only” or “include related columns”. The tabular icon to the right of these two options in the header can be used to select up to 20 columns for display with the second option. By default, the first 20 columns when sorted alphabetically are displayed with the second option. The “sql” to the left of the tabular icon can be used to copy the SQL Where clause for the drill down query to the Clipboard for pasting into another text area or application.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 161 Chapter 4: Descriptive Statistics Values Values Drill Down List

Additional drill down features: • Save each list before creating a new one if it needs to be used again. After saving, a small icon appears next to the drill down icon in the upper right corner of the graph page. • Clicking on the thumbnail reloads the saved list, enabling repeated reviews of the source data. In addition, the saved list can be copied, edited by deletions or additions, and then replaced over the old list or saved as new. Clicking on the drill down icon reveals a list of default descriptive labels for the saved drill downs, which can be changed to something the user finds more meaningful. • Saved drill down lists can be deleted by right-clicking on thumbnail icons, or from the drill down label list. The following right-click menu options are available from the display of drill down data rows after drill down is performed. They can be used to copy data to Microsoft Excel or to the Clipboard. When copying data to the Clipboard, columns are delineated by tab characters. • Export to Excel — All Rows • Export to Excel — Selected Rows • Copy (to Clipboard) — All Rows • Copy (to Clipboard) — Selected Rows

Note: Drill down cannot retrieve source rows for a column that is renamed with an alias. Drill down will also not work properly when the table being analyzed is a volatile table created by a referenced analysis that doesn't create an output table or view.

Note: Note also that drill down with the histogram crosstab option selected will not work properly if either subject column name is greater than 24 characters in length, in which case the workaround is to create a view to analyze.

Show Graph The following right-click options are available: • Maximize, Print and Export — The standard options Maximize, Print and Export are described in RESULTS Tab. • Absolute Counts (bar graph) — By default, a bar graph is given showing the absolute counts of NULL, Unique, Zero Positive and Negative values for numeric and date columns, and NULL, Unique and Blank

162 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Values values for all other data types. Unless the value returned is 0, a gray bar is shown for all values, overlaid by colors for each of the following value types: ∘ NULL — Red ∘ Unique — Green ∘ Zero — Orange ∘ Positive — Teal ∘ Negative — Purple ∘ Blank - Blue • Hide Remaining Rows — When Absolute Counts is selected, an option to Hide Remaining Rows may be selected to not display the top (gray) portion of each bar on the graph. By default, the inverse of the portions of the Values analysis are shown in gray. For example, a table with 1000 rows may have 150 NULL values. The box plot for NULL’s would then show 1000 rows, 850 in gray color and 150 in red. When Hide Remaining Rows is in effect, the option to Show Remaining Rows may be selected to return to the original view. • Relative Counts (circle graph) — Optionally, a circular graph can be displayed, showing the proportion of each value category to the total count. It should not be confused with a “Pie Chart” where each proportion, or piece of the pie, is assumed to sum to 1. Instead, each value category sums to 1, using the same color coding scheme as described above. Each inverse proportion (i.e., Not NULL, Not Unique, and so forth) is shown in grey and labeled accordingly. The relative counts of each value category as also displayed above the circular graph, along with the relative percentage. Relative percentages are also shown on the graph itself for each value category, and its associated inverse.

Graph Options A data-grid control is used to select the data to display. Much like a Microsoft Excel pivot-table, the Data Grid has the following properties: • Select — The data to be graphed can be selected by either clicking in the left upper most square of the Data Grid to select the entire data set, or holding the left mouse button down, dragging over the desired rows and releasing the mouse button. The rows highlighted will be graphed automatically. • Sort — To sort the data in the Data Grid, click the right mouse button on any column header. The sort is always done from left to right with respect to the columns of data. Subsequent right mouse clicks toggle the sort from ascending to descending. • Pivot — Pivoting of data is accomplished by holding the left mouse button down on the column header that you wish to pivot, dragging the column to its new desired position and releasing the mouse button. The pivoted column will appear to the left of the column that it was dragged on top of. For a Values analysis, column headers can take on the following values: • Table — Table that the variable for the values operation resides in, as specified by the “table” parameter. • Column — Variable that the values operation will be run against, as specified by the “column” parameter. • Type — Sata type of this variable. • Count — Total number of occurrences of this variable. • Null — Total number of rows where this variable takes on a null value. • Unique — Total number of rows where this variable takes on a unique value. • Blank — Total number of rows where this variable is blank. • Zero — Total number of rows where this variable is equal to 0. • Pos — Total number of rows where this variable has a positive value.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 163 Chapter 4: Descriptive Statistics Values • Neg — Total number of rows where this variable has a negative value. If a single group by column was specified, an additional group by selector is displayed, with a column header that is the name of the group by column. All distinct values of the group by column are shown and can be selected for display.

Values - RESULTS - SQL 1. On the Values dialog box, click on RESULTS. 2. Click on SQL. Values > Results > SQL

On this screen, the generated SQL is returned as text which can be copied by using the Select All and Copy buttons.

Tutorial - Values Analysis

Values - Example #1 1. Parameterize a Values analysis as follows: • Columns to Analyze ∘ twm_customer.income ∘ twm_customer.age ∘ twm_customer.years_with_bank ∘ twm_customer.nbr_children ∘ twm_customer.gender ∘ twm_customer.marital_status • Compute the Number of Unique Values for each column — Enabled 2. Run the analysis. 3. When it completes, click in the RESULTS tab. For this example, the Values analysis generated the following results. The SQL is not shown for brevity.

Values Example #1 Data

xuniqu xtbl xcol xtype xcnt xnull e xblank xzero xpos xneg TWM_CUST… income INTEGER 747 0 640 0 102 645 0 TWM_CUST… age SMALLIN 747 0 77 0 0 747 0 T

164 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Values

xuniqu xtbl xcol xtype xcnt xnull e xblank xzero xpos xneg TWM_CUST… years_with_ban SMALLIN 747 0 10 0 88 659 0 k T TWM_CUST… nbr_children SMALLIN 747 0 6 0 466 281 0 T TWM_CUST… gender CHAR(1) 747 0 2 0 0 0 0 TWM_CUST… marital_status CHAR(1) 747 0 4 0 0 0 0

Values Example #1 Graph

Thumbnails for each column in the analyses are shown in Absolute Counts (bar graph) format. As the mouse passes over each thumbnail, the cursor changes to indicate a hyperlink, and the individual graphs can be maximized by clicking on the thumbnail:

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 165 Chapter 4: Descriptive Statistics Values Values Example #1 Graph: Maximize Thumbnail

The maximized graphic becomes modal and must closed to return to the user interface. 4. Right click on a graph and select Relative Counts (circle graph) to view that graphic.

166 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Values Values Example #1 Graph: Relative Counts

Values - Example #2 1. Parameterize a Values analysis as follows: • Columns to Analyze ∘ twm_customer.income ∘ twm_customer.age ∘ twm_customer.years_with_bank ∘ twm_customer.nbr_children ∘ twm_customer.marital_status • Compute the Number of Unique Values for each column — Enabled • Columns to Include in the Group By Clause — gender 2. Run the analysis in the same manner described above. This time, the following Results were generated. The SQL is not shown for brevity:

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 167 Chapter 4: Descriptive Statistics Values Values Example #2 Graph

xuniqu xblan Gender xtbl xcol xtype xcnt xnull e k xzero xpos xneg M TWM_C… income INTEGER 329 0 284 0 45 284 0 F TWM_C… income INTEGER 418 0 359 0 57 361 0 M TWM_C… age SMALLIN 329 0 71 0 0 329 0 T F TWM_C… age SMALLIN 418 0 74 0 0 418 0 T M TWM_C… years_with_bank SMALLIN 329 0 10 0 37 292 0 T F TWM_C… years_with_bank SMALLIN 418 0 10 0 51 367 0 T M TWM_C… nbr_children SMALLIN 329 0 6 0 206 123 0 T F TWM_C… nbr_children SMALLIN 418 0 6 0 260 158 0 T M TWM_C… marital_status CHAR(1) 329 0 4 0 0 0 0 F TWM_C… marital_status CHAR(1) 418 0 4 0 0 0 0

Values Example #2 Graph

As above, thumbnails are displayed, this time for each column and each group by column value, in Absolute Counts (bar graph) format. As the mouse passes over each thumbnail, the cursor changes to indicate a hyperlink, and the individual graphs can be maximized by clicking on the thumbnail. The

168 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Statistical Analysis Relative Counts (circle graph) are also available, each graph displaying both the column and group by column value. In the picture above, the data selection tab was first used to select all of the data, resulting in the display of page 1 of 2.

Statistical Analysis When dealing with numeric data columns, it is useful to have several statistical measures to understand the characteristics and properties of each of those numeric columns, to assess their quality, and to look for outlying values and other possible anomalies. Statistical analysis provides several common and not so common statistical measures for numeric data columns. Extended options include additional analyses and measures such as Values, Modes, Quantiles, and Ranks (top 5 and bottom 5 values, and their respective counts). The Values analysis provided is also available separately, as described in Values. Given a table name and the name(s) of numeric column(s), Statistical analysis determines descriptive statistics for each of the column(s). Univariate Statistics provided include the following: • Count • Minimum • Maximum • Mean • Standard Deviation • Skewness • Kurtosis • Standard Error • Coefficient of Variance • Variance • Sum • Uncorrected Sums of squares • Corrected Sums of squares For columns of type DATE, statistics other than count, minimum, maximum and mean are calculated by first converting to the number of days since 1900. In addition to these basic numerical statistics, extended statistics can be requested to add the following to the analysis: • Values — Count of rows with value 0, rows with a unique value, rows with a positive value, rows with a negative value, and the number of rows containing blanks in the given column • Modes — Modal value and number of modes • Quantiles — Bottom ten, top ten, deciles, quartiles, and tertiles. With the extended option Quantiles, the bottom 10, top 10, deciles, quartiles and tertiles are determined by dividing the data set into the respective equal size groups and providing the max value from that group. • Rank — Top 5 and bottom 5 ranked values, with respective counts With the extended option Modes, only the minimum modal value is returned and an additional column called “xnbrmodes” for the number of modes or modality is generated. For example, if the modes are 10 and 20, the value 10 is returned for the mode with xnbrmodes equal to 2. With the extended option Rank, the top 5 and bottom 5 values are determined as distinct values, with counts also provided for each value. For example, if the top values are 5,6,7,8,9,10,10,10, the top 5 values returned are 6,7,8,9,10 with counts 1,1,1,1,3.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 169 Chapter 4: Descriptive Statistics Statistical Analysis If multiple columns are requested, a VOLATILE table is built, and all columns are processed in a single CREATE VOLATILE TABLE AS SELECT… statement. Data is reformatted with individual INSERT/ SELECT statements into the final output dataset as described below. In this case, the create view option may not be requested. By default, population statistics are generated unless the Sample option is selected. The Statistical analysis is parameterized by specifying the table and column(s) to analyze, options unique to the Statistical analysis, as well as specifying the desired results and SQL or Expert Options.

Note: For general information about output, see OUTPUT Tab.

Initiating a Statistical Analysis Use the following procedure to initiate a new Statistical analysis in Teradata Warehouse Miner. 1. Click on the Add New Analysis icon in the toolbar. Add New Analysis from Toolbar

2. In the resulting Add New Analysis dialog box, double-click on the Statistics icon. Add New Analysis: Descriptive Statistics

170 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Statistical Analysis The Statistics dialog box appears, in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. • Statistics - INPUT - Data Selection • Statistics - INPUT - Analysis Parameters • Statistics - INPUT - Expert Options • Statistics - OUTPUT

Statistics - INPUT - Data Selection 1. On the Statistics dialog box, click on INPUT. 2. Click on data selection. Statistics > Input > Data Selection

The resulting screen has the following options available: • Select Input Source — Users who are not using the Teradata Profiler program may select between different sources of input. By selecting the Table option, the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Analysis option, however, the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases, the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 171 Chapter 4: Descriptive Statistics Statistical Analysis Statistics - INPUT - Analysis Parameters 1. On the Statistics dialog box, click on INPUT. 2. Click on analysis parameters. Statistics > Input > Analysis Parameters

The resulting screen has the following options available: • Basic Statistics Options — The following basic univariate statistics are individually selectable for the analysis. By default, the Number of Values, Minimum Value, Maximum Value, Mean Value and Standard Deviation are selected (and must be selected for graphs to be available). The Check All and Clear All buttons can be used to enable or disable all options. ∘ Number of Values (required for graphs) — A count of the total number of rows (observations) with values for the specified column. ∘ Minimum Value (required for graphs) — The smallest value taken on by the column:

∘ Maximum Value (required for graphs) — The largest value taken on by the column:

∘ Mean Value (required for graphs) — The average value of the column:

where n is the total number of rows (observations) with values for the variable x. ∘ Standard Deviation (required for graphs) — The standard deviation of the variable. The standard deviation is a measure of how widely values are dispersed from the average value (the mean), and is calculated as follows, based on the entire population (by default):

If Sample Statistics are chosen, the following formula is used:

172 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Statistical Analysis

In both cases, n is the total number of rows (observations) with values for the variable x. ∘ Skewness — The skewness of the variable is a characterization of the degree of asymmetry of a distribution around its mean. Positive skewness indicates a distribution with an asymmetric tail extending toward more positive values. Negative skewness indicates a distribution with an asymmetric tail extending toward more negative values.

Note: The measures for Skewness (and Kurtosis) that are provided by Teradata Warehouse Miner are also known as the “Fisher g statistics,” related to the “momental skewness and kurtosis” [D’Agostino, Belanger, and D’Agostino Jr.].

Skewness is calculated as follows, based on the entire population (by default):

If Sample Statistics are chosen, the sample standard deviation, as shown above, is used for s. Otherwise, the population standard deviation is used. In the above equation, n is the total number of rows (observations) with values for the variable x. Note that skewness is undefined when either the standard deviation of the variable is equal to 0, or the number of occurrences is less than 3. ∘ Kurtosis — The kurtosis of the variable is a characterization of the relative peakedness or flatness of a distribution compared with the normal distribution. Positive kurtosis indicates a relatively peaked distribution. Negative kurtosis indicates a relatively flat distribution.

Note: The measures for Kurtosis (and Skewness) that are provided by Teradata Warehouse Miner are also known as the “Fisher g statistics,” related to the “momental skewness and kurtosis” [D’Agostino, Belanger, and D’Agostino Jr.].

Kurtosis is calculated as follows, based on the entire population (by default):

If Sample Statistics are chosen, the sample standard deviation, as shown above, is used for s. Otherwise, the population standard deviation is used. In the equation above, n is the total number of rows (observations) with values for the variable x. Note that kurtosis is undefined when either the standard deviation of the variable is equal to 0, or the number of occurrences is less than 4.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 173 Chapter 4: Descriptive Statistics Statistical Analysis ∘ Standard Error — The standard error of the variable, calculated as the standard deviation divided by the square root of the number of occurrences. Standard error is calculated as follows, based on the entire population (by default):

If Sample Statistics are chosen, the following formula is used:

In both cases, n is the total number of rows (observations) with values for the variable x. ∘ Coefficient of Variance — The coefficient of variance of the variable, calculated as 100 times the standard deviation divided by the mean. Coefficient of variance is calculated as follows, based on the entire population (by default):

If Sample Statistics are chosen, the following formula is used:

In both cases, n is the total number of rows (observations) with values for the variable x. Note that coefficient of variance is undefined when the average of the variable is 0. ∘ Variance — The variance of the variable, calculated as the square of the standard deviation. Variance is calculated as follows, based on the entire population (by default):

174 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Statistical Analysis

If Sample Statistics are chosen, the following formula is used:

In both cases, n is the total number of rows (observations) with values for the variable x. ∘ Sum — The sum of the variable, calculated as:

where n is the total number of occurrences of this variable. ∘ Uncorrected Sums of squares — The uncorrected sums of squares of the variable, calculated as:

where n is the total number of occurrences of this variable. ∘ Corrected Sums of squares — The corrected sums of squares of the variable, calculated as:

where n is the total number of occurrences of this variable. • Number Select List Items ∘ Auto-Calculate — When checked, an attempt is made to determine the number of select list items that should be included in the SQL for the Basic Statistics Options. In some cases however, the SQL may fail due to too many select list items being generated, dependent on the number of input columns and the Basic Statistics Options requested. In this case the Auto-Calculate option should be unchecked and a value provided in the Maximum Number... text box below it.

Tip: When processing more than 300 input columns with the first five basic statistics requested, try setting the maximum items to 1000 or less in the text box below. ∘ Maximum — An integer greater than 0 representing the maximum number of items that will appear in any given SELECT statement generated for the Basic Statistics Options. • Extended Statistics Options: — The following additional statistics are individually selectable for the analysis. By default, none are selected. The Check All and Clear All buttons can be used to enable or disable all options.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 175 Chapter 4: Descriptive Statistics Statistical Analysis ∘ Values — Extend the Statistical analysis by adding the count of various kinds for the selected column(s), including: a. Number of Rows b. Rows with Non-NULL Values c. Rows with NULL Values d. Unique Values e. Rows with Value ‘0’ f. Rows with a Positive Value g. Rows with a Negative Value h. Rows Containing Blank Values ∘ Modes — Extend the Statistical analysis by adding the calculation of Modal or most frequently occurring values. ∘ Quantiles — Extend the Statistical analysis by adding the calculation of the bottom and top ten percentiles, deciles, quartiles and tertiles. ∘ Rank — Extend the Statistical analysis by adding the bottom five and top five values and their respective counts. • Statistical Method ∘ Sample — Use sample statistics for those statistical calculations where a Sample formula was given. ∘ Population — Use population statistics for the statistical calculations.

Statistics - INPUT - Expert Options 1. On the Statistics dialog box, click on INPUT. 2. Click on expert options. Statistics > Input > Expert Options

The resulting screen has the WHERE option available: • Where Clause text — Option to generate a SQL WHERE clause(s) to restrict rows selected for analysis.

Statistics - OUTPUT Before executing the analysis, define output options. 1. On the Statistics dialog box, click on OUTPUT.

176 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Statistical Analysis Statistics > Output

The resulting screen has the following options: • Use the Teradata Explain Feature… — Option to generate a SQL EXPLAIN SELECT statement, which returns a Teradata Execution Plan to the RESULTS tab. • Store the tabular output of this analysis in the database — Option to generate a Teradata TABLE or VIEW populated with the results of the analysis. Once enabled, the following three fields must be specified: ∘ Database Name — Text box to specify the name of the Teradata database where the resultant Table or View will be created in. By default, this is the “Result Database.” ∘ Output Name — Text box to specify the name of the Teradata Table or View. ∘ Output Type — Pull-down to specify Table or View. ∘ Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. For more information, see Stored Procedure Support (Teradata Database). ∘ Procedure Comment — When an optional procedure comment is entered, it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags , and , respectively). Note that the default value of this field may be set on the Defaults tab of the Preferences dialog box that is available from the Tools > Preferences menu option. ∘ Create output table using the FALLBACK keyword — If a table is selected, it will be built with FALLBACK if this option is selected. ∘ Create output table using the MULTISET keyword — If a table is selected, it will be built as a MULTISET table if this option is selected. ∘ Advertise Output — This option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. ∘ Advertise Note — An advertise note may be specified when the Advertise Output option is selected, or when the Always Advertise option is selected on the Connection Properties dialog box. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate the SQL for this analysis, but do not execute it — If this option is selected, the analysis will only generate SQL, returning it and terminating immediately.

Running the Statistics Analysis After setting parameters on the INPUT and OUTPUT screens as described above, you are ready to run the analysis. 1. To run the analysis, you can either:

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 177 Chapter 4: Descriptive Statistics Statistical Analysis

• Click the Run icon on the toolbar, or • Select Run on the Project menu, or • Press the F5 key on your keyboard

Results - Statistics Analysis The results of running the Teradata Warehouse Miner Statistical analysis include the generated SQL itself, the results of executing the generated SQL, Box and Whisker plots and, if the Create Table (or View) option is chosen, a Teradata table (or view). All of these results are outlined below. • Statistics - RESULTS - Data • Statistics - RESULTS - Graph • Statistics - RESULTS - SQL

Note: The RESULTS tab will be grayed-out (disabled) until you have run the analysis.

Statistics - RESULTS - Data 1. On the Statistics dialog box, click on RESULTS. 2. Click on data (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed). Statistics > Results > Data

Statistics Result Data

Name Type Definition COL Group By Column Type Column(s) will be created only if the Group By option is specified. Multiple columns, if specified, will be created or displayed as the column name. These column(s) will contain the unique values that the group by column takes on. xtbl VARCHAR (30) Table that the variable for the statistics operation resides in, as specified by the “table” parameter.

178 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Statistical Analysis

Name Type Definition xcol VARCHAR (30) Variable that the statistics operation will be run against, as specified by the “column” parameter. xcnt FLOAT The total number of occurrences of this variable. xmin FLOAT The minimum value of the variable. Created only if the Minimum Value option is selected. xmax FLOAT The maximum value of the variable. Created only if the Maximum Value option is selected. xmean FLOAT The arithmetic mean of the variable. Created only if the Mean Value option is selected. xstd FLOAT The standard deviation of the variable. Created only if the Standard Deviation option is selected. xskew FLOAT The skewness of the variable. Created only if the Skewness option is selected. xkurt FLOAT The kurtosis of the variable. Created only if the Kurtosis option is selected. xste FLOAT The standard error of the variable. Created only if the Standard Error option is selected. xcv FLOAT The coefficient of variance of the variable. Created only if the Coefficient of Variance option is selected. xvar FLOAT The variance of the variable. Created only if the Variance option is selected. xsum FLOAT The sum of the variable. Created only if the Sum option is selected. xuss FLOAT The uncorrected sums of squares of the variable. Created only if the Uncorrected Sums of squares option is selected. xcss FLOAT The corrected sums of squares of the variable. Created only if the Corrected Sums of squares option is selected. xnull FLOAT Total number of rows where this variable takes on a null value. Created only if the Values option is selected. xunique FLOAT Total number of rows where this variable takes on a unique value. Created only if the Values option is selected. xblank FLOAT Total number of rows where this variable is blank. Created only if the Values option is selected. xzero FLOAT Total number of rows where this variable is equal to 0. Created only if the Values option is selected. xpos FLOAT Total number of rows where this variable has a positive value. Created only if the Values option is selected.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 179 Chapter 4: Descriptive Statistics Statistical Analysis

Name Type Definition xneg FLOAT Total number of rows where this variable has a negative value. Created only if the Values option is selected. xmode FLOAT Modal, or most frequently occurring value that the variable is equal to. Note that this is the minimum modal value; if there are multiple modes, xnbrmodes will exist and contain the total number of modal values. Created only if the Modes option is selected. xmode_cnt FLOAT The number of times the modal value occurs. If the group option is used, the count specified is within each group. Created only if the Modes option is selected. xmode_pct FLOAT The percentage that the modal value occurs with respect to all selected records. If the group option is used, the percentage within each group. Created only if the Modes option is selected. xnbrmodes FLOAT The total number of modal values. Created only if the Modes option is selected. xpctileN FLOAT Percentiles, deciles, quartiles and tertiles of the variable. N is defined as follows: • Bottom 10 percentiles (N=0-9) • Deciles (N=10, 20, 30, 40, 50, 60, 70, 80, 90) • Quartiles (N=25, 50, 75) • Tertiles (N=33,67) • Top 10 percentiles (N=91-100) Created only if the Quantiles option is selected. xmin_N FLOAT The five smallest occurring values that the variable takes on. Created only if the Rank option is selected. N=1-5 xmincnt_N FLOAT The number of times that the variable takes on the five smallest occurring values. Created only if the Rank option is selected. N=1-5 xmax_N FLOAT The five largest occurring values that the variable takes. Created only if the Rank option is selected. N=1-5 xmaxcnt_N FLOAT The number of times that the variable takes on the five largest occurring values. Created only if the Rank option is selected. N=1-5

Statistics - RESULTS - Graph 1. On the Statistics dialog box, click on RESULTS.

180 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Statistical Analysis 2. Click on graph (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed). Statistic > Results > Graph

Box and Whisker Plots are available for each column specified for the Statistical analysis, provided that 2 or more Group By Columns have not been selected. Some of the features of the graphs provided for the Statistics analysis are described in the following sections: • Statistics Graph Drill Down Functions • Show Graph • Statistical Data Grid

Statistics Graph Drill Down Functions Drill down into one or more graphs to see column statistics from the source data. Each column statistic selected will be entered into an editable drill down list automatically (a drill down window pops up as graphs are clicked). All columns in the list will be used as drill down criteria, resulting in a drill down data table containing rows belonging to all of the columns selected. First, select one or more graphs from the Zoom View for drill down. There are two ways to select graphs: 1. Click on one or more graphs to add column statistics to the drill down list. Within each graph, select either the displayed box or one of the statistics listed below it, i.e., min, max, mean - SD (mean minus standard deviation) or mean + SD (mean by itself cannot be selected for drill down). 2. Or, send values to the drill down list by selecting rows in the data selector, right-clicking on the rows and clicking on the ‘Drill Down’ pop-up menu item. In this case, the statistics min, max and mean +/- SD will be provided for the columns indicated in the selected rows. The drill down list can be edited after graph selections are added, and saved for future reference. Note that the drill down list column labeled Col Row Count gives the number of rows containing a non-null value for the selected column (not the number of rows expected from the drill down for the indicated statistic). Second, select the drill down tab on the Drill Down window to query the source data table. Rows from each column statistic selected will be returned and displayed in the drill down table. In the header of the drill down display are options to “show graph data only”or “include related columns”. The tabular icon to the right of these two options in the header can be used to select up to 20 columns for display with the second option. By default, the first 20 columns when sorted alphabetically are displayed with the second option. The “sql” to the left of the tabular icon can be used to copy the SQL Where clause for the drill down query to the Clipboard for pasting into another text area or application. Additional drill down features: • Save each list before creating a new one if it needs to be used again. After saving, a small icon appears next to the drill down icon in the upper right corner of the graph page. Clicking on the thumbnail will reload the saved list, enabling repeated reviews of the source data.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 181 Chapter 4: Descriptive Statistics Statistical Analysis • The saved list can be copied, edited by deletions or additions, and then replaced over the old list or saved as new. Clicking on the drill down icon reveals a list of default descriptive labels for the saved drill downs, which can be changed to something the user finds more meaningful. • Saved drill down lists can be deleted by right-clicking on thumbnail icons, or from the drill down label list. Statistics Drill Down

Show Graph The following right-click options are available: • Maximize, Print and Export — The standard options Maximize, Print and Export are described in RESULTS Tab.

Statistical Data Grid A data-grid control is used to select the data to display. Much like a Microsoft Excel pivot-table, the Data Grid has the following properties: • Select — The data to be graphed can be selected by either clicking in the left upper most square of the Data Grid to select the entire data set, or holding the left mouse button down, dragging over the desired rows and releasing the mouse button. The rows highlighted will be graphed automatically. • Sort — To sort the data in the Data Grid, click the right mouse button on any column header. The sort is always done from left to right with respect to the columns of data. Subsequent right mouse clicks toggle the sort from ascending to descending.

182 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Statistical Analysis • Pivot — Pivoting of data is accomplished by holding the left mouse button down on the column header that you wish to pivot, dragging the column to its new desired position and releasing the mouse button. The pivoted column will appear to the left of the column that it was dragged on top of. For the Histogram analysis, the column headers can take on the following values: • Table — Table that the variable for the statistics operation resides in, as specified by the “table” parameter. • Column — Variable that the statistics operation will be run against, as specified by the “column” parameter. • Count — The total number of occurrences of this variable. • Min — The minimum value of the variable. Created only if the Minimum Value option is selected. • Max — The maximum value of the variable. Created only if the Maximum Value option is selected. • Mean — The arithmetic mean of the variable. Created only if the Mean Value option is selected. • Std — The standard deviation of the variable. Created only if the Standard Deviation option is selected.

Statistics - RESULTS - SQL 1. On the Statistics dialog box, click on RESULTS. 2. Click on SQL (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed). Statistics > Results > SQL

On this screen, the generated SQL is returned as text which can be copied by using the Select All and Copy buttons.

Tutorial - Statistical Analysis

Statistics - Example #1 1. Parameterize a Statistical analysis as follows: • Columns to Analyze ∘ twm_customer.income ∘ twm_customer.age • Basic Statistics Options ∘ Number of Values ∘ Minimum Value ∘ Maximum Value ∘ Mean Value ∘ Standard Deviation

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 183 Chapter 4: Descriptive Statistics Statistical Analysis ∘ Skewness ∘ Kurtosis ∘ Standard Error ∘ Coefficient of Variance ∘ Variance ∘ Sum ∘ Uncorrected Sum of Squares ∘ Corrected Sum of Squares 2. Run the analysis. 3. When it completes, click in the RESULTS tab. For this example, the Statistical analysis generated the following results. Note that for brevity, the SQL is not shown.

Statistics - Example #1 Data (Part 1)

xtbl xcol xcnt xmin xmax xmean xstd xskew xkurt TWM_C… income 747 0 144157 22728.2811245 22192.3521561 1.751929 4.3293003 TWM_C… age 747 13 89 42.4792503 19.1020801 .2342409 -.7738404

Statistics - Example #1 Data (Part 2)

xtbl xcol xste xcv xvar xsum xuss xcss TWM_C… income 811.9757039 97.6420172 492500494.218 16978026 753779217048 367897869180.96 1 4 TWM_C… age .6989086 44.9680253 364.8894624 31732 1620524 272572.4283802

Statistics Example #1 Graph

Thumbnails for each column in the analyses are shown in a box and whisker plot. The “box” is comprised of a green dot within a blue rectangle. The dot represents the mean value, while the rectangle itself represents plus/minus one standard deviation from the mean. Two “whiskers” are drawn from minus one standard deviation to the minimum value, and from plus one standard deviation to the maximum value. As the mouse passes over each thumbnail, the cursor changes to indicate a hyperlink, and the individual graphs can be maximized by clicking on the thumbnail:

184 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Statistical Analysis Statistics Example #1 Graph: Maximize Thumbnail

Statistics - Example #2 1. Parameterize a Statistical analysis as follows: • Columns to Analyze ∘ twm_customer.income ∘ twm_customer.age • Extended Statistics Options ∘ Values ∘ Modes ∘ Quantiles ∘ Rank 2. Run the analysis in the same manner described above. This time, the following Results were generated. Again, the SQL is not shown:

Note: Data table pivoted for readability.

Statistics - Example #2 Data

xtbl TWM_CUSTOMER TWM_CUSTOMER xcol Income age xnull 0 0

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 185 Chapter 4: Descriptive Statistics Statistical Analysis

xtbl TWM_CUSTOMER TWM_CUSTOMER xunique 640 77 xblank 0 0 xzero 102 0 xpos 645 747 xneg 0 0 xmode 0 15 xmode_cnt 102 32 xmode_pct 13.6546185 4.2838019 xnbrmodes 1 1 xpctile0 0 13 xpctile1 0 13 xpctile2 0 13 xpctile3 0 14 xpctile4 0 14 xpctile5 0 15 xpctile6 0 15 xpctile7 0 15 xpctile8 0 15 xpctile9 0 16 xpctile10 0 16 xpctile20 4859 22 xpctile25 7083 28 xpctile30 8575 31 xpctile33 9738 32 xpctile40 12485 36 xpctile50 17242 42 xpctile60 22025 48 xpctile67 26548 52 xpctile70 28518 54 xpctile75 31379 56 xpctile80 36761 59 xpctile90 50704 68

186 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Statistical Analysis

xtbl TWM_CUSTOMER TWM_CUSTOMER xpctile91 54713 69 xpctile92 56384 70 xpctile93 57723 73 xpctile94 60242 74 xpctile95 66930 77 xpctile96 75890 79 xpctile97 79057 80 xpctile98 86055 83 xpctile99 98566 85 xpctile100 144157 89 xmin_1 0 13 xmin_2 1039 14 xmin_3 1565 15 xmin_4 1591 16 xmin_5 1814 17 xmincnt_1 102 15 xmincnt_2 1 15 xmincnt_3 1 32 xmincnt_4 1 19 xmincnt_5 1 21 xmax_5 111004 85 xmax_4 127848 86 xmax_3 129196 87 xmax_2 142274 88 xmax_1 144157 89 xmaxcnt_5 1 1 xmaxcnt_4 1 1 xmaxcnt_3 1 1 xmaxcnt_2 1 3 xmaxcnt_1 1 2

Note: No graphs are available for these Stats Options.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 187 Chapter 4: Descriptive Statistics Statistical Analysis Statistics - Example #3 1. Parameterize a Statistical analysis as follows: • Columns to Analyze ∘ twm_customer.income ∘ twm_customer.age • Columns to Group By — gender • Statistics Options ∘ Number of Values ∘ Minimum Value ∘ Maximum Value ∘ Mean Value ∘ Standard Deviation 2. Run the analysis in the same manner described above. This time, the following Results were generated. Again, the SQL is not shown:

Statistics - Example #3 Data

gender xtbl xcol xcnt xmin xmax xmean xstd M TWM_CUSTOMER income 329 0 144157 26405.6930091 25700.3655032 F TWM_CUSTOMER income 418 0 102286 19833.8588517 18472.7539608 M TWM_CUSTOMER age 329 13 88 42.662614 19.4246694 F TWM_CUSTOMER age 418 13 89 42.3349282 18.8430378

Statistics Example #3 Graph

As above, thumbnails are displayed, this time for each column and each group by column value, in box and whisker format. As the mouse passes over each thumbnail, the cursor changes to indicate a hyperlink, and the individual graphs can be maximized by clicking on the thumbnail.

188 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Frequency Frequency Frequency analysis is designed to count the occurrence of individual data values in columns that contain categorical data. It can be useful in understanding the meaning of a particular data element, and it may point out the need to recode some of the data values found, either permanently or in the course of building an analytic data set. This function can also be useful in analyzing combinations of values occurring in two or more columns. Given a table name and the name of one or more columns, the Frequency analysis calculates the number of occurrences of each value of the column or columns individually or in combination. Additionally, the percentage of rows in the selected table are listed in descending order starting with the most frequently occurring value. Optionally, you may request: • Whether to calculate frequencies of column values individually or in combination, using the Compute Cross-Tabulation option. • Whether to calculate pair-wise frequencies from one or two lists of column values, using the Compute Pairwise Frequencies option. • Whether to calculate basic statistics (min, max, mean, standard deviation) on one or more columns. • Whether to provide additional cumulative sums over the frequencies and percents, in addition to the associated rank, using the “Cumulative Options”. This feature is not available when the Compute Pairwise Frequencies option is selected. • A different sort order, such as by the selected column(s). • A WHERE clause, reducing the rows before aggregation. • A HAVING clause, reducing the answer set after aggregating, which must refer to the requested column(s), xcnt or xpct. In the case of Compute Pairwise Frequencies, reference to col1, col2, xcnt or xpct is required (this is implemented as a final WHERE clause). • A QUALIFY clause, reducing the answer set after aggregating, which may refer to any returned column, but is most useful in conjunction with xrank to specify the maximum number of rows to return, for instance “xrank ≤ 50” (requires setting the “cumulative” option). The following rules apply to the Frequency analysis: 1. If the Compute Cross-Tabulation option is not requested (the default case) and multiple columns are requested, the analysis is repeated individually for each requested column. In this case the CREATE VIEW option may not be requested, and if the CREATE TABLE option is requested, the create occurs only once with subsequent INSERT/SELECT statements generated. 2. If the Compute Cross-Tabulation option is requested, one select is generated for the entire column list taken together. 3. If the Cumulative Options is requested, cumulative sums for the column(s) being analyzed are provided. Specifically, cumulative sums of the frequencies and percents are provided, as well as the associated rank. This is not available when the Compute Pairwise Frequencies option is selected. 4. If the Compute Pairwise Frequencies option is requested, a count of the number of occurrences in the table for each pair-wise combination of values in the selected columns is given along with the percentage of the total number of rows. Alternatively, two lists of columns can be given, with each column in the first list combined with each column in the second list to count the number of occurrences of each combination of values as above. 5. If multiple column lists are given, the Compute Cross-Tabulation option is not allowed. 6. Explain and Create View options are not allowed when multiple columns are selected. 7. Statistics column(s) can only be specified with basic frequency, not in combination with cumulative, crosstab or pairwise options.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 189 Chapter 4: Descriptive Statistics Frequency 8. When multiple columns are requested with Select result option, columns are combined within a single query using a volatile table with a final select at the end to format the results. 9. BYTE types are supported only with the Select result option and individual column requests. The Frequency analysis is parameterized by specifying the table and column(s) to analyze, options unique to the Frequency analysis, as well as specifying the desired results and SQL or Expert Options.

Note: For general information about output, see OUTPUT Tab.

Initiating a Frequency Analysis Use the following procedure to initiate a new Frequency analysis in Teradata Warehouse Miner. 1. Click on the Add New Analysis icon in the toolbar. Add New Analysis From Toolbar

2. In the resulting Add New Analysis dialog box, double-click on the Frequency icon. Add New Analysis: Descriptive Statistics

The Frequency dialog box appears, in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. • Frequency - INPUT - Data Selection

190 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Frequency • Frequency - INPUT - Analysis Parameters • Frequency - INPUT - Expert Options • Frequency - OUTPUT

Frequency - INPUT - Data Selection 1. On the Frequency dialog box, click on INPUT. 2. Click on data selection. Frequency > Input > Data Selection

Note: View is only available when a single column is selected or when pairwise is selected. • Select Columns From a Single Table ∘ Available Databases (or Analyses) — Choose the database (or analysis) from which you will select data tables. ∘ Available Tables — Select the table from which you will select columns. ∘ Available Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. • Select Frequency Style ∘ Frequency Style ▪ Basic — Option to count frequencies of individual column values. ▪ Pairwise Frequencies — Option to count frequencies of pair-wise combinations of values of selected columns rather than individually. Not available if the Compute Cross-Tabulation option or Include cumulative measure and rank option have been selected. ▪ Cross-Tabulation — Option to count frequencies of combinations of values of selected columns rather than individually. Results in an error if columns have been selected in Selected Cross Columns/Aliases, and not available if the Compute Pairwise Frequencies option has been selected.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 191 Chapter 4: Descriptive Statistics Frequency • Select Optional Columns ∘ Statistics Columns — If Basic Frequency Style is selected, expand the Statistics Columns selector by clicking on the “double-up-arrow” ( ). Select columns by highlighting and then either dragging and dropping into the Statistics Columns window, or click on the arrow button to move highlighted columns into the Statistics Columns window. ∘ Pairwise Columns — If Pairwise Frequency Style is selected, expand the Pairwise Columns selector by clicking on the “double-up-arrow” ( ). Select columns by highlighting and then either dragging and dropping into the Pairwise Columns window, or click on the arrow button to move highlighted columns into the Pairwise Columns window. In this manner, the pairwise combinations of Selected Columns and Pairwise Columns is taken. Otherwise, only the combinations of Selected Columns is taken.

Frequency - INPUT - Analysis Parameters 1. On the Frequency dialog box, click on INPUT. 2. Click on analysis parameters. Frequency > Input > Analysis Parameters

This screen has the following options available: • Only include frequency values that occur a minimum percentage of the time ∘ Minimum Percentage — If checked, this option will show frequency only for those values that occur in at least this percentage of rows. Enter an integer or decimal value between 0 and 100. The default value is 1.0 for 1 percent. • Include rank, cumulative count, and cumulative percent information for each frequency value ∘ If checked, show these cumulative statistics for each frequency generated. This option is not available when the Frequency Style 'Pairwise' has been selected on the data selection tab. • Only include the top 'n' frequency values ∘ Number of frequency values to include — If checked, show frequency only for the number of top occurring values entered. This option is enabled only if Include rank, cumulative count, and cumulative percent information … is selected above.

Frequency - INPUT - Expert Options 1. On the Frequency dialog box, click on INPUT. 2. Click on expert options.

192 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Frequency Frequency > Input > Expert Options

This screen has the following options: • WHERE Clause text — Option to have the specified SQL WHERE clause generated within the Frequency SQL to filter rows selected for analysis (for example, cust_id > 0). • HAVING Clause text — Option to have the specified SQL HAVING clause generated within the Frequency SQL to restrict returned aggregations (for example, xpct > 1). This option is enabled only if Only include frequency values … is not checked. • QUALIFY Clause text — Option to have the specified SQL QUALIFY clause generated within the Frequency SQL to restrict OLAP results returned (for example, RANK(xval) ≤ 1000). This option is enabled only if Include rank … is checked and Only include the top 'x' frequency values is not checked.

Frequency - OUTPUT Before running the analysis, define Output options. 1. On the Frequency dialog box, click on OUTPUT. Frequency > Output

Use this screen to define the following options: • Use the Teradata EXPLAIN Feature… — Option to generate a SQL EXPLAIN SELECT statement, which returns a Teradata Execution Plan to the RESULTS tab. • Store the tabular output of this analysis in the database — Option to generate a Teradata TABLE or VIEW populated with the results of the analysis. Once enabled, the following three fields must be specified: ∘ Database Name — Text box to specify the database name. ∘ Output Name — Text box to specify the name of the Teradata Table or View. ∘ Output Type — Pull-down to specify Table or View. Although a Table is always available, Views can not be specified when multiple columns are selected and the Compute Cross-Tabulation or Compute Pairwise Frequencies are not. ∘ Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. For more information, see Stored Procedure Support (Teradata Database).

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 193 Chapter 4: Descriptive Statistics Frequency ∘ Procedure Comment — When an optional Procedure Comment is entered it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags , and , respectively). The default value of this field may be set on the Defaults tab of the Preferences dialog box, available from the Tools > Preferences menu option. ∘ Create output table using the FALLBACK keyword — If a table is selected, it will be built with FALLBACK if this option is selected ∘ Create output table using the MULTISET keyword — If a table is selected, it will be built as a MULTISET table if this option is selected ∘ Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. For more information, see Advertise Output. ∘ Advertise Note — An advertise note may be specified when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog box. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate the SQL for this analysis, but do not execute it — If this option is selected, the analysis will only generate SQL, returning it and terminating immediately.

Running the Frequency Analysis After setting parameters on the INPUT and OUTPUT screens as described above, you are ready to run the analysis. 1. To run the analysis, you can either:

• Click the Run icon on the toolbar, or • Select Run on the Project menu, or • Press the F5 key on your keyboard

Results - Frequency Analysis The results of running the Teradata Warehouse Miner Frequency analysis include the generated SQL itself, the results of executing the generated SQL, two and/or three dimensional histograms, and, if the Create Table (or View) option is chosen, a Teradata table (or view). All of these results are outlined below. • Frequency - RESULTS - Data • Frequency - RESULTS - Graph • Frequency - RESULTS - SQL Click on the RESULTS tab of the Frequency analysis when it has completed to view these results.

Frequency - RESULTS - Data 1. On the Frequency dialog box, click on RESULTS. 2. Click on data (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed).

194 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Frequency Frequency > Results > Data

Table Created Without Cross-Tabulation or Pair-wise Frequencies Options

Name Type Definition xtbl VARCHAR (30) Table that the frequency variable resides in, as specified by the “table” parameter. xcol VARCHAR (30) Column name representing the frequency variable, as specified by the “column” parameter. xval VARCHAR (256) Distinct values of the frequency variable. xcnt FLOAT Count of the number of rows that the frequency variable is equal to the distinct value specified in xval. xpct FLOAT Percentage of the total records where the frequency variable is equal to the distinct value specified in xval. xmin_COL FLOAT If the Statistics Columns option is selected, then this column is created and represents the minimum value of the column specified for the given frequency value (xval). Not available if the Include cumulative measures and rank option is selected. xmax_COL FLOAT If the Statistics Columns option is selected, then this column is created and represents the maximum value of the column specified for the given frequency value (xval). Not available if the Include cumulative measures and rank option is selected. xmean_COL FLOAT If the Statistics Columns option is selected, then this column is created and represents the mean value of the column specified for the given frequency value (xval). Not available if the Include cumulative measures and rank option is selected. xstd_COL FLOAT If the Statistics Columns option is selected, then this column is created and represents the standard deviation of the column specified for the given frequency value (xval).

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 195 Chapter 4: Descriptive Statistics Frequency

Name Type Definition Not available if the Include cumulative measures and rank option is selected. xcum_cnt FLOAT A running cumulative count of the number of rows that the frequency variable is equal to the distinct value specified in xval. The Include cumulative measures and rank option must be selected to get this result column. Not valid if the Columns for Statistical Analysis option is selected. xcum_pct FLOAT A running cumulative percentage of the total records where the frequency variable is equal to the distinct value specified in xval. The Include cumulative measures and rank option must be selected to get this result column. Not valid if the Columns for Statistical Analysis option is selected. xrank FLOAT The rank, with respect to all frequencies, where the frequency variable is equal to the distinct value specified in xval. The Include cumulative measures and rank option must be selected to get this result column. Not valid if the Columns for Statistical Analysis option is selected.

With Cross-Tabulation Option Selected

Name Type Definition COL Input Column Type Column name that the crosstab frequency will be calculated, as specified by the “column” parameter. Multiple columns, if specified, will be created or displayed as the column name. xcnt FLOAT Count of the number of rows that the frequency variable is equal to the distinct value specified in xval. xpct FLOAT Percentage of the total records where the frequency variable is equal to the distinct value specified in xval. xcum_cnt FLOAT A running cumulative count of the number of rows that the frequency variable is equal to the distinct value specified in xval. The Include cumulative measures and rank option must be selected to get this result column. xcum_pct FLOAT A running cumulative percentage of the total records where the frequency variable is equal to the distinct value specified in xval. The Include cumulative measures and rank option must be selected to get this result column. xrank FLOAT The rank, with respect to all frequencies, where the frequency variable is equal to the distinct value specified in xval. The Include cumulative measures and rank option must be selected to get this result column.

196 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Frequency With Pair-wise Frequencies Option Selected

Name Type Definition xtbl VARCHAR (30) Table that the frequency columns reside in, as specified by the “table” parameter. col1 VARCHAR (30) Column representing the first frequency variable, as specified by the “column” parameter. val1 VARCHAR (256) Distinct values of the first frequency column. col2 VARCHAR (30) Column representing the second frequency variable, as specified by the “column” or, alternatively, “column2” parameter. val2 VARCHAR (256) Distinct values of the second frequency column. xcnt FLOAT Count of the number of rows that the pair-wise combination of frequency columns are equal to the distinct value specified in val1 and val2, respectively. xpct FLOAT Percentage of the total records that the pair-wise combination of frequency variables are equal to the distinct value specified in val1 and val2, respectively.

Frequency - RESULTS - Graph 1. On the Frequency dialog box, click on RESULTS. 2. Click on graph (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed). Frequency > Results > Graph

There are several different graphs available in Histogram format for the Frequency analysis, depending upon the parameterization. These graphs are scrollable, with up to 16 frequency counts shown per viewing area, unless the Zoom Out (All Data) option is selected by right-clicking on the graph image (this option is available on two-dimensional graphs only). When the three-dimensional view is desired, note that the number in the upper-most left side of the graph indicates the degrees of rotation about the vertical axis. The Frequency analysis graph options are detailed in the following topics: • Frequency Graph Drill Down Functions • Show Graph • Frequency Data Grid • Frequency Graph Options

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 197 Chapter 4: Descriptive Statistics Frequency Frequency Graph Drill Down Functions Drill down into one or more frequency counts from one or more columns to see source data rows. Each count category selected will be entered into an editable drill down list automatically (a drill down window appears as categories are selected). All categories in the list will be used as drill down criteria, resulting in a drill down data table containing rows belonging to all of the categories selected. There are two ways to select frequency categories for drill down: First, select frequency counts that target the source data of interest. Multiple counts can be selected from different graphs to build up composite views of source data. There are two ways to select frequency counts for drill down: 1. Click on one or more bars to add column categories to the drill down list. 2. Or, send column categories to the drill down list by selecting rows in the data selector, right-clicking on the rows and clicking on the ‘Drill Down’ pop-up menu item. The drill down list can be edited after graph selections are added, and saved for future reference. Second, select the ‘drill down’ tab on the Drill Down window to query the source data table. Filtered rows (rows that fit the drill down criteria) will be returned and displayed in the drill down table. In addition to the graph data, all related rows from the source table can be viewed as well. In the header of the drill down display are options to “show graph data only” or “include related columns”. The tabular icon to the right of these two options in the header can be used to select up to 20 columns for display with the second option. By default, the first 20 columns when sorted alphabetically are displayed with the second option. The “sql” to the left of the tabular icon can be used to copy the SQL Where clause for the drill down query to the Clipboard for pasting into another text area or application. Frequency Drill Down List

198 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Frequency The following right-click menu options are available from the display of drill down data rows after drill down is performed. They can be used to copy data to Microsoft Excel or to the Clipboard. When copying data to the Clipboard, columns are delineated by tab characters. • Export to Excel — All Rows • Export to Excel — Selected Rows • Copy (to Clipboard) — All Rows • Copy (to Clipboard) — Selected Rows Note that drill down cannot retrieve source rows for a column that is renamed with an alias. Also, drill down will not work properly when the table being analyzed is a volatile table created by a referenced analysis that does not create an output table or view.

Show Graph The following right-click options are available: • Maximize, Print, Export and Zoom Out — The standard options Maximize, Print, Export and Zoom Out are described in RESULTS Tab.

Frequency Data Grid A data grid control is used to select the data to display. Much like a Microsoft Excel pivot-table, the Data Grid has the following properties: • Select — The data to be graphed can be selected by either clicking in the left upper most square of the Data Grid to select the entire data set, or holding the left mouse button down, dragging over the desired rows and releasing the mouse button. The rows highlighted will be graphed automatically. • Sort — To sort the data in the Data Grid, click the right mouse button on any column header. The sort is always done from left to right with respect to the columns of data. Subsequent right mouse clicks toggle the sort from ascending to descending. • Pivot — Pivoting of data is accomplished by holding the left mouse button down on the column header that you wish to pivot, dragging the column to its new desired position and releasing the mouse button. The pivoted column will appear to the left of the column that it was dragged on top of. For the Frequency analysis, the column headers can take on the following values: • — The name of the column that the Frequency analysis was executed against. When Compute Cross-tabulation or Pair-wise Frequencies Options are NOT selected, there will be a single column in the Data Grid. When the Pair-wise Frequencies options is selected, there will be two columns. In the case of the Compute Cross-Tabulation option, there will be one

Frequency Graph Options The following options are available, depending upon the parameterization of the Frequency analysis:

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 199 Chapter 4: Descriptive Statistics Frequency • Select Frequency Column — Pull-down list containing variable names if more than one was selected. This will only be present if either the Compute Cross-Tabulation or Pair-wise Frequencies Options are NOT selected. • Show Frequency Counts — Radio button to display the individual frequency counts instead of cumulative counts. Only present if the Include cumulative measure and rank option is selected. ∘ 2D Graph/3D Graph — All frequency graphs are available in two-dimensions. If the Compute Cross-Tabulation or Pair-wise Frequencies options are selected, both 2D and 3D graphs are available. When the Compute Cross-Tabulation option is selected with more than two variables, a 3D view is not available. • Show Cumulative Counts — Radio button to display the cumulative statistics in addition to the frequency counts. Only available if the Include cumulative measure and rank option is selected. • Select Col1/Select Col2 — Pull-down list containing all the variable names available for display when the Pair-wise Frequencies option is selected. The variable specified in the Select Col1: pull-down will be displayed on the x-axis in a three-dimensional view. • Show Counts and Statistics — Radio button to display the statistical values of the variables chosen by the Statistics Columns option, in addition to the frequency. • Select Stats column — Pull-down list containing variable names specified by the Statistics Columns option, if that option is selected. • Frequency Graph Types — There are five distinct graph types available for the Frequency analysis, once again depending upon the parameterization. These include the following: 1. Basic Frequency 2. Basic Frequency with Cumulative Stats 3. Basic Frequency with Basic Stats 4. Cross-tabulated Frequency 5. Cross-tabulated Frequency with Cumulative Stats 6. Pair-wise Frequency There are some practical limitations on the Cross-tabulated Frequency (i.e., the Compute Cross- Tabulation was selected) graphs. These are documented below: ∘ Cross-tabulated Frequency — A Cross-Tabulation graph will provide the best display if the nonnumeric values selected within the groups are limited to 8 and 15, depending on the number of columns selected. The specific limits are:

Cross-Tabulated Frequency Limits

Number Of Columns Distinct Values Displayed >5 Column Cross-Tabulation Graph is not generated 5 Column Cross-Tabulation 8 categorical values 4 Column Cross-Tabulation 10 categorical values 3 Column Cross-Tabulation 12 categorical values 2 Column Cross-Tabulation 15 categorical values

Frequency - RESULTS - SQL 1. On the Frequency dialog box, click on RESULTS.

200 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Frequency 2. Click on SQL (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed). Frequency > Results > SQL

The SQL generated for the analysis is returned here as text, which can be copied by using the Select All and Copy buttons.

Tutorial - Frequency Analysis

Frequency - Example #1 1. Parameterize a Frequency analysis as follows: • Columns to Analyze: ∘ twm_customer.gender ∘ twm_customer.marital_status ∘ twm_customer.nbr_children 2. Run the analysis. 3. When it completes, click the RESULTS tab. For this example, the Frequency analysis generated the following results. Note that the SQL is not shown for brevity.

Frequency Analysis - Example #1 Data xtbl xcol xval xcol xpct TWM_CUSTOMER gender F 418 55.957162 TWM_CUSTOMER gender M 329 44.042838 TWM_CUSTOMER marital_status 2 353 47.2556894 TWM_CUSTOMER marital_status 1 276 36.9477912 TWM_CUSTOMER marital_status 4 70 9.3708166 TWM_CUSTOMER marital_status 3 48 6.4257028 TWM_CUSTOMER nbr_children 0 466 62.3828648 TWM_CUSTOMER nbr_children 1 114 15.2610442 TWM_CUSTOMER nbr_children 2 110 14.7255689 TWM_CUSTOMER nbr_children 3 38 5.0870147

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 201 Chapter 4: Descriptive Statistics Frequency

xtbl xcol xval xcol xpct TWM_CUSTOMER nbr_children 5 10 1.3386881 TWM_CUSTOMER nbr_children 4 9 1.2048193

Frequency Analysis - Example #1 Graph

Note that the “gender” column takes on the value of “F” 418 times, and the value “M” 329 times. Click on the Graph Options tab as described above to change the data being graphed from “gender” to “marital_status,” or “nbr_children.”

Frequency - Example #2 1. Parameterize a Frequency analysis as follows: • Columns to Analyze ∘ twm_customer.gender ∘ twm_customer.marital_status • Frequency Style — Cross-Tabulation 2. Run the analysis in the same manner described above. This time, the following Results were generated. Again, the SQL is not shown.

Frequency Analysis Example #2 Data

gender marital_status xcnt xpct F 2 189 25.3012048 M 2 164 21.9544846 F 1 159 21.2851406 M 1 117 15.6626506

202 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Frequency

gender marital_status xcnt xpct F 4 40 5.3547523 F 3 30 4.0160643 M 4 30 4.0160643 M 3 18 2.4096386

Frequency Analysis Example #2 Graph

Note that the “gender” column cross-tabulated with “marital_status” take on the values of “1” and “F” respectively 159 times. 3. Click on the Graph Options tab as described above to change the view to three dimensions.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 203 Chapter 4: Descriptive Statistics Frequency Frequency Analysis Example #2 Graph: Three-Dimensional View

This three-dimensional view can be rotated by dragging the vertical or horizontal scroll bars, or automatically by double-clicking on the graph. The rotation can be stopped by double-clicking the image a second time.

Frequency - Example #3 1. Parameterize a Frequency analysis as follows: • Columns to Analyze ∘ twm_customer.gender ∘ twm_customer.marital_status • Pairwise Columns to Analyze — nbr_children • Frequency Style — Pair-wise 2. Run the analysis in the same manner described above. This time, the following Results were generated. Again, not all SQL is shown.

Frequency Analysis Example #3 Data

xtbl col1 val1 col2 val2 xcnt xpct TWM_CUSTOMER marital_status 1 nbr_children 0 276 36.9477912 TWM_CUSTOMER gender F nbr_children 0 260 34.8058902 TWM_CUSTOMER gender M nbr_children 0 206 27.5769746 TWM_CUSTOMER marital_status 2 nbr_children 0 156 20.8835341 TWM_CUSTOMER marital_status 2 nbr_children 1 85 11.3788487 TWM_CUSTOMER marital_status 2 nbr_children 2 78 10.4417671

204 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Frequency xtbl col1 val1 col2 val2 xcnt xpct TWM_CUSTOMER gender F nbr_children 1 64 8.5676037 TWM_CUSTOMER gender F nbr_children 2 64 8.5676037 TWM_CUSTOMER gender M nbr_children 1 50 6.6934404 TWM_CUSTOMER gender M nbr_children 2 46 6.1579652 TWM_CUSTOMER marital_status 2 nbr_children 3 24 3.2128514 TWM_CUSTOMER gender F nbr_children 3 21 2.811245 TWM_CUSTOMER marital_status 4 nbr_children 1 20 2.6773762 TWM_CUSTOMER marital_status 4 nbr_children 0 18 2.4096386 TWM_CUSTOMER gender M nbr_children 3 17 2.2757697 TWM_CUSTOMER marital_status 3 nbr_children 2 17 2.2757697 TWM_CUSTOMER marital_status 3 nbr_children 0 16 2.1419009 TWM_CUSTOMER marital_status 4 nbr_children 2 15 2.0080321 TWM_CUSTOMER marital_status 4 nbr_children 3 10 1.3386881 TWM_CUSTOMER marital_status 3 nbr_children 1 9 1.2048193 TWM_CUSTOMER gender M nbr_children 5 6 .8032129 TWM_CUSTOMER marital_status 2 nbr_children 4 5 .669344 TWM_CUSTOMER marital_status 2 nbr_children 5 5 .669344 TWM_CUSTOMER gender F nbr_children 4 5 .669344 TWM_CUSTOMER marital_status 4 nbr_children 4 4 .5354752 TWM_CUSTOMER marital_status 3 nbr_children 3 4 .5354752 TWM_CUSTOMER gender F nbr_children 5 4 .5354752 TWM_CUSTOMER gender M nbr_children 4 4 .5354752 TWM_CUSTOMER marital_status 4 nbr_children 5 3 .4016064 TWM_CUSTOMER marital_status 3 nbr_children 5 2 .2677376

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 205 Chapter 4: Descriptive Statistics Frequency Frequency Analysis Example #3 Graph

Note that the “gender” column taken with “nbr_children” takes on the value of “F” and “0” 260 times. Click on the Graph Options tab as described above to change the view to “marital_status” and “nbr_children”. 3. Click on the Graph Options tab as described above to change the view to three dimensions. Frequency Analysis Example #3 Graph: Three-Dimensional View

206 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Frequency This three-dimensional view can be rotated by dragging the vertical or horizontal scroll bars, or automatically by double-clicking on the graph. The rotation can be stopped by double-clicking the image a second time.

Frequency - Example #4 1. Parameterize a Frequency analysis as follows: • Columns to Analyze ∘ twm_customer.gender ∘ twm_customer.marital_status • Frequency Style — Cross-Tabulation • Include rank, cumulative count, and cumulative percentage… — Enabled 2. Run the analysis in the same manner described above. This time, the following Results were generated. Again, the SQL is not shown.

Frequency Analysis Example #4 Data

marital_stat gender us xcnt xpct xcum_cnt xcum_pct xrank F 2 189 25.3012048 189 25.3012048 1 M 2 164 21.9544846 353 47.2556894 2 F 1 159 21.2851406 512 68.54083 3 M 1 117 15.6626506 629 84.2034806 4 F 4 40 5.3547523 669 89.5582329 5 F 3 30 4.0160643 699 93.5742972 6 M 4 30 4.0160643 729 97.5903614 7 M 3 18 2.4096386 747 100 8

Note that clicking on the Frequency Graph page will by default bring up the same two-dimensional or three dimensional graph as in Frequency - Example #2. 3. Click on the Graph Options tab and select Show cumulative statistics to view the following graph.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 207 Chapter 4: Descriptive Statistics Frequency Frequency Analysis Example #4 Graph

Note that the raw frequency counts are plotted by a red bar, while the cumulative statistics are shown in blue.

Frequency - Example #5 1. Parameterize a Frequency analysis as follows: • Columns to Analyze ∘ twm_customer.gender ∘ twm_customer.marital_status • Columns for Statistical Analysis — twm_customer.income • Frequency Style — Basic 2. Run the analysis in the same manner described above. This time, the following results were generated. Again, the SQL is not shown.

Frequency Analysis Example #5 Data

xmin_inco xmax_inco xmean_inoc xstd_incom xtbl xcol xval xcnt xpct me me me e twm_customer gender F 418 55.96 0.00 102286.00 19833.86 18472.75 twm_customer gender M 329 44.04 0.00 144157.00 26405.69 25700.37 twm_customer marital_status 2 353 47.26 1039.00 144157.00 26587.44 21107.38 twm_customer marital_status 1 276 36.95 0.00 111004.00 14167.23 18384.49 twm_customer marital_status 4 70 9.37 2772.00 90248.00 26914.53 21090.97 twm_customer marital_status 3 48 6.43 3303.00 142274.00 37468.50 31971.36

208 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Histogram 3. Click on Graph Options. 4. Select Show Counts and Statistics. Frequency Analysis Example #5 Graph

Note that box and whisker plots are shown for income when gender is F and M in the lower display. In the upper display, the total counts for gender are shown.

Histogram Histogram analysis is designed to study the distribution of continuous numeric or date values in a column by providing the data necessary to create a histogram graph. This type of analysis is sometimes also referred to as binning because it counts the occurrence of values in a series of numeric ranges called bins. The histogram analysis provided in Teradata Warehouse Miner is particularly rich in functionality, providing a number of ways to define bins and allowing multi-dimensional binning, overlaying of categorical data, and the calculation of numeric statistics within bins. Given a table name, the name of one or more numeric columns, and either the desired number of equal sized data bins, the desired number of bins with a nearly equal number of values, a desired width, or the specific boundaries, the Histogram analysis separates the data to show its distributional properties. It does this by separating the data by “bin” number and gives counts and percentages over the requested rows. Percentages will always sum to 100%. Separate options are available to specify a number of equal sized data bins in which the analysis determines the minimum and maximum value, as well as a user specified minimum and maximum value. If the minimum and maximum are specified, all values less than the minimum are put in to “bin 0,” while all values greater than the maximum are put in to “bin N+1.” The same is true when the boundary option is specified. The Histogram analysis optionally provides subtotals within each bin of the count, percentage within the bin, and percentage overall for each value or combination of values of one or more overlaid columns. Another option is provided to collect simple statistics for a binned column or another column of numeric or date type within the table, providing the minimum, maximum, mean, and standard deviation. When statistics are collected for a date type column, the standard deviation is given in units of days.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 209 Chapter 4: Descriptive Statistics Histogram Part of the information returned for each bin is the beginning and ending range value delimiting the bin. In general, beginning range values are inclusive and ending range values are exclusive except for the last bin. One exception to this is that if equally populated bins are requested the ending range values are inclusive since they are actually the maximum values in the ranges. Another exception is that the beginning and ending values of bins of a date type column may be truncated to a whole date value when equal width or specified width bins are requested. When this happens to an ending range value, the value is inclusive. If this happens to a beginning range value, the value is exclusive. Therefore, since it will probably not be obvious when truncation has occurred, beginning and ending date range values for equal width or specified width bins should be considered approximate. If multiple columns are requested, the select is repeated for each column, unless the Cross-tab (Multidimensional analysis) option is selected. In this case, all columns are cross-tabulated with one another within a single select statement. If the create table option is requested along with multiple columns, the create table occurs only once with the insert/select repeated. An optional WHERE clause may be used to reduce the range of bins or to reduce the rows to bin in some other way.

Initiating a Histogram Analysis Use the following procedure to initiate a new Histogram analysis. 1. Click on the Add New Analysis icon in the toolbar. Add New Analysis: From Toolbar

2. In the resulting Add New Analysis dialog box, double-click on the Histogram icon.

210 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Histogram Add New Analysis: Descriptive Statistics

The Histogram dialog box appears, in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. • Histogram - INPUT - Data Selection • Histogram - INPUT - Analysis Parameters • Histogram - INPUT - Expert Options • Histogram - OUTPUT

Histogram - INPUT - Data Selection 1. On the Histogram dialog box, click on INPUT. 2. Click on data selection. Histogram > Input > Data Selection

3. On this screen select: • Select Input Source — Users who are not using the Teradata Profiler program may select between different sources of input. By selecting the Table option, the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Analysis option, however, the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases, the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected analysis, or it contains a single entry with the name of the

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 211 Chapter 4: Descriptive Statistics Histogram analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. For more information, see INPUT Tab.

Note: View is only available when a single column is selected or when crosstab is selected. • Select Columns From a Single Table ∘ Available Databases (or Analyses) — Choose the database (or analysis) from which you will select data tables. ∘ Available Tables — Select the table from which you will select columns. • Available Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. • Select Histogram Style ∘ Histogram Style ▪ Basic — Option to create a histogram for individual columns. ▪ Crosstab — Option to create a multidimensional histogram by combining columns. ∘ Select Optional Columns ▪ Overlay Columns — Expand the Overlay Columns selector by clicking on the “double-up- arrow” ( ). This represents a list of overlay columns to subdivide each bin. An overlay column is typically a categorical variable with only a few values. If an overlay column is specified, frequencies within each bin are calculated for each value of that overlay column (frequencies for crosstabs of values are given if more than one overlay column is requested). A specific column can be used in either Overlay Columns or Statistics Columns, but not both. Select columns by highlighting and then either dragging and dropping into the Overlay Columns window, or click on the arrow button to move highlighted columns into the Overlay Columns window. ▪ Statistics Columns — Expand the Statistics Columns selector by clicking on the “double-up- arrow” ( ). This represents a list of numeric columns/aliases which simple statistics will be calculated (minimum, maximum, mean and standard deviation) in each bin. Not available for DATE columns. A specific column can be used in either Statistics Columns or Overlay Columns but not both. Select columns by highlighting and then either dragging and dropping into the Statistics Columns window, or click on the arrow button to move highlighted columns into the Statistics Columns window.

Histogram - INPUT - Analysis Parameters 1. On the Histogram dialog box, click on INPUT. 2. Click on analysis parameters. Histogram > Input > Analysis Parameters

3. On this screen, select:

212 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Histogram • Bin Style ∘ Bins — Specify a number of equal sized data bins. By default 10 bins are derived for each column. ∘ Widths — Specify the desired width of each bin. ∘ Quantiles — Specify a number of bins with a nearly equal number of values. By default, 10 bins are derived for each column. ∘ Boundaries — Specify a list of the desired boundaries for each bin to start, with the final value indicating the end of the last bin. Note that bin 0 will be generated if necessary to contain data values less than the first boundary specified, and bin N+1 will be generated if necessary for those data values greater than the final boundary value. ∘ Bins with Boundaries — Specify the number of desired equal sized data bins, along with minimum and maximum values. By default, 10 bins are derived for each column. Note that bin 0 will be generated to contain data values less than the minimum specified, and bin N+1 will be generated for those data values greater than the maximum specified. • Bin Values for Selected Columns Each column selected for the Histogram analysis appears in this list, along with the default bin values, depending upon the Bin Style selected. Next to Column Name, one of the following appears: ∘ Bins — If Bins is selected, 10 appears as the number of bins to generate next to the column selected for the Histogram analysis. Highlight the Number of Bins to change the desired number of equal sized data bins. Entry must be an integer greater than 0. ∘ Widths — If Widths is selected, 0 appears next to the column selected for the histogram analysis. Highlight the Bin Width to enter the desired number of equal sized data bins. The values specified must be greater than 0. ∘ Quantiles — If Quantiles is selected, 10 appears as the number of bins to generate next to the column selected for the Histogram analysis. Highlight the Number of Quantiles to change the desired number of bins with a nearly equal number of values. Entry must be an integer greater than 0. ∘ Boundaries — If Boundaries is selected, enter for each requested column a list of numeric values corresponding to the starting values of each bin, plus one final value indicating the closing boundary of the final bin. A boundary list must contain two or more increasing numeric values, with dates entered as integer values in YYYYMMDD format. ∘ Bins with Boundaries — If Bins with Boundaries is selected, 10 appears by default as the number of bins to generate next to each column selected for the Histogram analysis. Highlight the Number of Bins to change the number of equal sized data bins if desired, and then enter a minimum and maximum value.

Histogram - INPUT - Expert Options 1. On the Histogram dialog box, click on INPUT. 2. Click on expert options. Histogram > Input > Expert Options

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 213 Chapter 4: Descriptive Statistics Histogram This screen has the following option: • WHERE Clause text — Option to generate a SQL WHERE clause(s) to restrict rows selected for analysis.

Histogram - OUTPUT Before running the analysis, define Output options. 1. On the Histogram dialog box, click on OUTPUT. Histogram > Output

On this screen, select from the following options. If no options are specified, by default a SELECT statement is generated. • Use the Teradata EXPLAIN Feature… — Option to generate a SQL EXPLAIN SELECT statement, which returns a Teradata Execution Plan to the RESULTS tab. • Store the tabular output of this analysis in the database — Option to generate a Teradata TABLE or VIEW populated with the results of the analysis. Once enabled, the following three fields must be specified: ∘ Database Name — Text box to specify the name of the Teradata database where the resultant Table or View will be created in. By default, this is the “Result Database.” ∘ Output Name — Text box to specify the name of the Teradata Table or View. ∘ Output Type — Pull-down to specify Table or View. Although a Table is always available, Views can not be specified due to limitations in Teradata. ∘ Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. For more information, see Stored Procedure Support (Teradata Database). ∘ Procedure Comment — When an optional procedure comment is entered, it is applied to a requested stored procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags , and , respectively). The default value of this field may be set on the Defaults tab of the Preferences dialog box, available from the Tools > Preferences menu option. ∘ Create output table using the FALLBACK keyword — If a table is selected, it will be built with FALLBACK if this option is selected. ∘ Create output table using the MULTISET keyword — If a table is selected, it will be built as a MULTISET table if this option is selected. ∘ Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. For more information, see Advertise Output.

214 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Histogram ∘ Advertise Note — An advertise note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog box. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate the SQL for this analysis, but do not execute it — If this option is selected, the analysis will only generate SQL, returning it and terminating immediately.

Note: For general information about output, see OUTPUT Tab.

Running the Histogram Analysis After setting parameters on the INPUT and OUTPUT screens as described above, you are ready to run the analysis. 1. To run the analysis, you can either:

• Click the Run icon on the toolbar, or • Select Run on the Project menu, or • Press the F5 key on your keyboard

Results - Histogram The results of running the Teradata Warehouse Miner Histogram analysis include the generated SQL itself, the results of executing the generated SQL, two and/or three dimensional histograms, and, if the Create Table (or View) option is chosen, a Teradata table (or view). All of these results are outlined below. • Histogram - RESULTS - Data • Histogram - RESULTS - Graph • Histogram - RESULTS - SQL Click on the RESULTS tab of the Histogram analysis when it has completed to view these results.

Histogram - RESULTS - Data 1. On the Histogram dialog box, click on RESULTS. 2. Click on data (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed). Histogram > Results > Data

Results data, if any, is displayed in a data grid as described in RESULTS Tab.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 215 Chapter 4: Descriptive Statistics Histogram The following is a description of the results returned by the analysis, depending on the options selected. If an output table is created, the columns in bold below will comprise the Unique Primary Index (UPI) of the output table.

Without Cross-tab (Multidimensional Analysis)

Name Type Definition xtbl VARCHAR (30) Table that the bin values are computed against, as specified in the “table” parameter. xcol VARCHAR (30) Column name or optional alias that the bin values were computed against, as specified by the “column” parameter. xbin INTEGER An integer representing the bin number. xbeg FLOAT Value that represents the beginning boundary for the bin of the column specified. xend FLOAT Value that represents the ending boundary for the bin of the column specified. xcnt FLOAT Number of records in the bin. xpct FLOAT Percentage of total records that this bin represents. ovly_COL Overlay Column Type Value of column to overlay within the bin. Only exists if Overlay Columns is specified. COL indicates the name or alias of the requested overlay column. This column will consist of unique categorical values that the bin is subdivided into. xocnt FLOAT Number of records within the bin in which this overlay column value is present. Created only if Overlay Columns is specified. xobpct FLOAT Percentage of records within the bin in which this overlay column value is present. Created only if Overlay Columns is specified. xopct FLOAT Percentage of total records in which this overlay column value is present. Created only if Overlay Columns is specified. xmin_COL FLOAT Minimum value of a requested “stats” column within the bin. Only exists if the Stats Columns option is specified. COL indicates the name or alias of the requested “stats” column. Multiple columns, if specified, will be created or displayed as a repeating group consisting of xmin_COL, xmax_COL, xmean_COL and xstd_COL for each COL name or alias (see xmax, xmean and xstd below). xmax_COL FLOAT Maximum value of a requested “stats” column within the bin. Only exists if the Stats Columns option is

216 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Histogram

Name Type Definition specified. COL indicates the name or alias of the requested “stats” column. xmean_COL FLOAT Mean (average) value of a requested “stats” column within the bin. Only exists if the Stats Columns option parameter is specified. COL indicates the name or alias of the requested overlay column. xstd_COL FLOAT Standard deviation of a requested “stats” column within the bin. Only exists if the Stats Columns option parameter is specified. COL indicates the name or alias of the requested overlay column.

With Cross-tab (Multidimensional Analysis) Option

Name Type Definition xbin_COL INTEGER Bin number of the column specified in the mandatory “column” parameter. COL is replaced with the name or alias of the requested column. Multiple columns, if specified, will be created or displayed as a repeating group consisting of xbin_COL, xbeg_COL, xend_COL for each COL name or alias. xbeg_COL FLOAT Value that represents the beginning boundary for the bin of the column specified. COL indicates the name or alias of the requested column. xend_COL FLOAT Value that represents the ending boundary for the bin of the column specified. COL indicates the name or alias of the requested column. xcnt FLOAT Number of records in the bin. xpct FLOAT Percentage of total records that this bin represents. ovly_COL Overlay Column Type Value of column to overlay within the bin. Only exists if the Overlay Columns option is specified. COL indicates the name or alias of the requested overlay column. This column will consist of unique categorical values that the bin is subdivided into. xocnt FLOAT Number of records within the bin in which this overlay column value is present. Created only if the Overlay Columns option is requested. xobpct FLOAT Percentage of records within the bin in which this overlay column value is present. Created only if the Overlay Columns is requested. xopct FLOAT Percentage of total records in which this overlay column value is present. Created only if the Overlay Columns is requested.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 217 Chapter 4: Descriptive Statistics Histogram

Name Type Definition xmin_COL FLOAT Minimum value of a requested “stats” column within the bin. Only exists if the Stats Columns option is specified. COL indicates the name or alias of the requested “stats” column. Multiple columns, if specified, will be created or displayed as a repeating group consisting of xmin_COL, xmax_COL, xmean_COL and xstd_COL for each COL name or alias (see xmax, xmean and xstd below). xmax_COL FLOAT Maximum value of a requested “stats” column within the bin. Only exists if the Stats Columns option is specified. COL indicates the name or alias of the requested “stats” column. xmean_COL FLOAT Mean (average) value of a requested “stats” column within the bin. Only exists if the Stats Columns option is specified. COL indicates the name or alias of the requested overlay column. xstd_COL FLOAT Standard deviation of a requested “stats” column within the bin. Only exists if the Stats Columns option is specified. COL indicates the name or alias of the requested overlay column.

Histogram - RESULTS - Graph There are several different graphs available for the Histogram analysis, depending upon the parameterization. These graphs are scrollable, with up to 16 bin counts shown per viewing area, unless the Zoom Out (All Data) option is selected by right-clicking on the graph image. This option is only available on two-dimensional graphs. When the three-dimensional view option is desired, the number in the upper-most left side of the graph indicates the degrees of rotation about the vertical axis. Three-dimensional graphs are limited to fifty bins along each axis. Note that when the Boundaries or Width options are selected, there is the potential to generate a NULL bin (i.e., a bin with no occurrences). In that case, the NULL bin(s) are displayed as such. If more than one column is selected within Overlay Columns, the graph will not be available. 1. On the Histogram dialog box, click on RESULTS. 2. Click on graph (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed). Histogram > Results > Graph

The Histogram analysis graph options are detailed in the following topics: • Histogram Graph Drill Down Functions

218 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Histogram • Show Graph • Histogram Data Grid • Histogram Graph Options • Histogram Graph Types

Histogram Graph Drill Down Functions Drill down into one or more bins from multiple columns to see source data rows. First, select bins that target the range of data to be retrieved. The inclusiveness or exclusivity of bin beginning and ending values can vary for columns of type 'Date' and for the ending value of the final bin. A simplification has been made such that, for drill down purposes, both the beginning and ending values of a bin's range are treated as inclusive, so it is possible that a row with value matching either the beginning or ending value may be displayed as belonging to more than one bin. Multiple bins can be selected from different graphs (columns) to build up composite views of source data. Each bin selected will be entered into an editable drill down list automatically (a drill down window pops up as values are clicked). All bins in the list will be used as drill down criteria, resulting in a drill down data table containing rows belonging to all of the ranges selected. There are two ways to select bins for drill down: 1. Click on one or more count bars or bin ranges to add bins to the drill down list. 2. Or, send bins to the drill down list by selecting rows in the data selector, right-clicking on the rows and clicking on the ‘Drill Down’ pop-up menu item. The drill down list can be edited after graph selections are added, and saved for future reference. Finally, to see the result of the drill down selections, click the ‘drill down’ tab in the Drill Down window to query the source data table. Filtered rows (rows that fit the drill down criteria) will be returned and displayed in the drill down table. In addition to the isolated graph data, all related rows from the source table can be viewed, as well. In the header of the drill down display are options to “show graph data only” or “include related columns”. The tabular icon to the right of these two options in the header can be used to select up to 20 columns for display with the second option. By default, the first 20 columns when sorted alphabetically are displayed with the second option. The “sql” to the left of the tabular icon can be used to copy the SQL Where clause for the drill down query to the Clipboard for pasting into another text area or application. Histogram Drill Down List

Additional drill down features:

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 219 Chapter 4: Descriptive Statistics Histogram • Save each list before creating a new one if it needs to be used again. After saving, a small icon appears next to the drill down icon in the upper right corner of the graph page. • Clicking on the thumbnail will reload the saved list, enabling repeated reviews of the source data. In addition, the saved list can be copied, edited by deletions or additions, and then replaced over the old list or saved as new. Clicking on the drill down icon reveals a list of default descriptive labels for the saved drill downs, which can be changed to something the user finds more meaningful. • Saved drill down lists can be deleted by right-clicking on thumbnail icons, or from the drill down label list. The following right-click menu options are available from the display of drill down data rows after drill down is performed. They can be used to copy data to Microsoft Excel or to the Clipboard. When copying data to the Clipboard, columns are delineated by tab characters. • Export to Excel — All Rows • Export to Excel — Selected Rows • Copy (to Clipboard) — All Rows • Copy (to Clipboard) — Selected Rows

Note: Drill down cannot retrieve source rows for a column that is renamed with an alias. Also, drill down will not work properly when the table being analyzed is a volatile table created by a referenced analysis that doesn't create an output table or view.

Note: Drill down with the Histogram crosstab option selected will not work properly if either subject column name is greater than 24 characters in length, in which case, the workaround is to create a view to analyze.

Show Graph The following right-click items are available: • Maximize, Print, Export and Zoom Out — The standard options Maximize, Print, Export and Zoom Out are described in RESULTS Tab.

Histogram Data Grid A data-grid control is used to select the data to display. Much like a Microsoft Excel pivot-table, the Data Grid has the following properties: • Select — The data to be graphed can be selected by either clicking in the left upper most square of the Data Grid to select the entire data set, or holding the left mouse button down, dragging over the desired rows and releasing the mouse button. The rows highlighted will be graphed automatically. • Sort — To sort the data in the Data Grid, click the right mouse button on any column header. The sort is always done from left to right with respect to the columns of data. Subsequent right mouse clicks toggle the sort from ascending to descending. • Pivot — Pivoting of data is accomplished by holding the left mouse button down on the column header that you wish to pivot, dragging the column to its new desired position and releasing the mouse button. The pivoted column will appear to the left of the column that it was dragged on top of. For the Histogram analysis, the column headers can take on the following values: • Bin Range — The high and low values in each bin. This column header will be present unless the Cross- tab (Multi-dimensional Analysis) options is selected.

220 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Histogram • Single Count — The number of occurrences within the specified bin. This column header will be present when the Show Single Counts radio button is enabled; see below. • Overlay — The distinct values of the variable specified by the Overlay Columns option. • Overlay Count — The number of occurrences of the overlay value within the specified bin. This column header will only be present if the Overlay Columns option is selected. • — The name of the column that the Histogram analysis was executed against. When the Crosstab option is selected, there will be one column for each variable selected in the analysis. • Count — The number of occurrences of the cross-tabulated variable values. This column header will only be present when the Crosstab option is selected. • Min/Max/Mean/Std — Respectively, the minimum, maximum, mean and standard deviation of the variable specified by the Statistics Columns option.

Histogram Graph Options A data-grid control is used to select the data to display. Much like a Microsoft Excel pivot-table, the Data Grid has the following properties: • Select Bin Column — Pull-down list containing variable names specified by the Column to Analyze option. This will only be present if the Crosstab option is NOT selected. • Show Single Counts — Radio button to display an individual variables distribution. Only present if the variables have been specified in Overlay Columns. • Show Overlay Counts — Radio button to display the overlay values in addition to the distribution. Only available if variables were specified by the Overlay Columns option. • 2D Graph/3D Graph — All histogram graphs are available in two-dimensions. If the Overlay Columns or Crosstab options are selected, both 2D and 3D graphs are available. When the Crosstab option is selected with more than two variables, a 3D view is not available. • Show Bin Stats — Radio button to display the statistical values of the variables chosen by the Statistics Columns option, in addition to the distribution. • Select Stats column — Pull-down list containing variable names specified by the Statistics Columns option, if that option is selected.

Histogram Graph Types There are five distinct graph types available for the Histogram analysis, once again depending upon the parameterization. These include the following: 1. Single Column Histogram 2. Single Column Histogram with Overlay 3. Single Column Histogram with Statistics 4. Single Column Histogram with Overlay and Statistics 5. Multi-dimensional/Cross-tab Histogram Additionally, when Boundaries or Bins with Boundaries is specified, the graphics show all data less than the lower bound in bin 0, and all data greater than the upper bound in bin n+1. There are some practical limitations on the histogram graphs. These are documented below: • A Crosstab graph will provide the best display if the non-numeric values selected within the groups are limited to 8 and 15, depending on the number of columns selected. • If more than one column is selected in Overlay Columns, a graph is not available. • If more than five columns are selected as a Crosstab Column, no graph is available.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 221 Chapter 4: Descriptive Statistics Histogram • Show Overlay Counts and Show Bin Stats are mutually exclusive. • Only 15 overlay values can be shown at a time before they begin to overlap. • 3D is only available when the Show Overlay Counts option is selected, or Crosstab is selected with exactly two columns specified in Columns to Analyze. • Note that multi-dimensional graphs are not available when the Show Bin Stats option is selected for the Histogram analysis.

Histogram - RESULTS - SQL 1. On the Histogram dialog bix, click on RESULTS. 2. Click on SQL (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed). Histogram > Results > SQL

On this screen, the generated SQL is returned as text, which can be copied by using the Select All and Copy buttons.

Tutorial - Histogram Analysis

Histogram - Example #1 1. Parameterize a Histogram analysis as follows: • Histogram Style — Basic • Columns to Analyze ∘ twm_customer.age ∘ twm_customer.income 2. Run the analysis, and when it completes, click in the RESULTS tab. For this example, the Histogram analysis generated the following results. Note that the SQL is not shown for brevity:

Histogram Analysis Example # 1 Data

xtbl xcol xbin xbeg xend xcnt xpct TWM_CUSTOMER age 1 13 20.6 140 18.7416332 TWM_CUSTOMER age 2 20.6 28.2 56 7.4966533 TWM_CUSTOMER age 3 28.2 35.8 92 12.3159304 TWM_CUSTOMER age 4 35.8 43.4 107 14.3239625

222 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Histogram xtbl xcol xbin xbeg xend xcnt xpct TWM_CUSTOMER age 5 43.4 51 88 11.7804552 TWM_CUSTOMER age 6 51 58.6 110 14.7255689 TWM_CUSTOMER age 7 58.6 66.2 71 9.5046854 TWM_CUSTOMER age 8 66.2 73.8 35 4.6854083 TWM_CUSTOMER age 9 73.8 81.4 28 3.7483266 TWM_CUSTOMER age 10 81.4 89 20 2.6773762 TWM_CUSTOMER income 1 0 14415.7 332 44.4444444 TWM_CUSTOMER income 2 14415.7 28831.4 191 25.5689424 TWM_CUSTOMER income 3 28831.4 43247.1 108 14.4578313 TWM_CUSTOMER income 4 43247.1 57662.8 63 8.4337349 TWM_CUSTOMER income 5 57662.8 72078.5 20 2.6773762 TWM_CUSTOMER income 6 72078.5 86494.2 19 2.5435074 TWM_CUSTOMER income 7 86494.2 100909.9 7 .9370817 TWM_CUSTOMER income 8 100909.9 115325.6 3 .4016064 TWM_CUSTOMER income 9 115325.6 129741.3 2 .2677376 TWM_CUSTOMER income 10 129741.3 144157 2 .2677376

Histogram Analysis Example #1 Graph

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 223 Chapter 4: Descriptive Statistics Histogram Note that the “age” column ranges from the value “13” to “89,” and that the first bin range is 13-20.6, where there are 140 occurrences. 3. Click on the Graph Options tab as described above to change the data being graphed from “age” to “income.”

Histogram - Example #2 1. Parameterize a Histogram analysis as follows: • Histogram Style — Basic • Columns to Analyze ∘ twm_customer.age ∘ twm_customer.income • Overlay Columns — gender • Statistics Columns — nbr_children 2. Run the analysis in the same manner as described above. This time, the following Results should be generated. Again, the SQL is not shown.

Histogram Analysis Example #2 Table

ovly_ge xtbl xcol xbin xbeg xend xcnt xpct n xocnt xopbct xopct twm_c... age 1 13 20.6 140 18.7416 F 78 55.7142 10.4417 332 857 671 twm_c... age 1 13 20.6 140 18.7416 M 62 44.2857 8.29986 332 143 61 twm_c... age 2 20.6 28.2 56 7.49665 F 33 58.9285 4.41767 33 714 07 twm_c... age 2 20.6 28.2 56 7.49665 M 23 41.0714 3.07898 33 286 26 twm_c... age 3 28.2 35.8 92 12.3159 F 49 53.2608 6.55957 304 696 16 twm_c... age 3 28.2 35.8 92 12.3159 M 43 46.7391 5.75635 304 304 88 twm_c... age 4 35.8 43.4 107 14.3239 F 63 58.8785 8.43373 625 047 49 twm_c... age 4 35.8 43.4 107 14.3239 M 44 41.1214 5.89022 625 953 76 twm_c... age 5 43.4 51 88 11.7804 F 52 59.0909 6.96117 552 091 8 twm_c... age 5 43.4 51 88 11.7804 M 36 40.9090 4.81927 552 909 71 twm_c... age 6 51 58.6 110 14.7255 F 58 52.7272 7.76439 689 727 09

224 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Histogram

ovly_ge xtbl xcol xbin xbeg xend xcnt xpct n xocnt xopbct xopct twm_c... age 6 51 58.6 110 14.7255 M 52 47.2727 6.96117 689 273 8 twm_c... age 7 58.6 66.2 71 9.50468 F 41 57.7464 5.48862 54 789 12 twm_c... age 7 58.6 66.2 71 9.50468 M 30 42.2535 4.01606 54 211 43 twm_c... age 8 66.2 73.8 35 4.68540 F 17 48.5714 2.27576 83 286 97 twm_c... age 8 66.2 73.8 35 4.68540 M 18 51.4285 2.40963 83 714 86 twm_c... age 9 73.8 81.4 28 3.74832 F 18 64.2857 2.40963 66 143 86 twm_c... age 9 73.8 81.4 28 3.74832 M 10 35.7142 1.33868 66 857 81 twm_c... age 10 81.4 89 20 2.67737 F 9 45 1.20481 62 93 twm_c... age 10 81.4 89 20 2.67737 M 11 55 1.47255 62 69 twm_c... income 1 0 14415.7 332 44.4444 F 200 60.2409 26.7737 444 639 617 twm_c... income 1 0 14415.7 332 44.4444 M 132 39.7590 17.6706 444 361 827 twm_c... income 2 14415.7 28831.4 191 25.5689 F 117 61.2565 15.6626 424 445 506 twm_c... income 2 14415.7 28831.4 191 25.5689 M 74 38.7434 9.90629 424 555 18 twm_c... income 3 28831.4 43247.1 108 14.4578 F 50 46.2962 6.69344 313 963 04 twm_c... income 3 28831.4 43247.1 108 14.4578 M 58 53.7037 7.76439 313 037 09 twm_c... income 4 43247.1 57662.8 63 8.43373 F 30 47.6190 4.01606 49 476 43 twm_c... income 4 43247.1 57662.8 63 8.43373 M 33 52.3809 4.41767 49 524 07 twm_c... income 5 57662.8 72078.5 20 2.67737 F 12 60 1.60642 62 57 twm_c... income 5 57662.8 72078.5 20 2.67737 M 8 40 1.07095 62 05

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 225 Chapter 4: Descriptive Statistics Histogram

ovly_ge xtbl xcol xbin xbeg xend xcnt xpct n xocnt xopbct xopct twm_c... income 6 72078.5 86494.2 19 2.54350 F 6 31.5789 .803212 74 474 9 twm_c... income 6 72078.5 86494.2 19 2.54350 M 13 68.4210 1.74029 74 526 45 twm_c... income 7 86494.2 100909. 7 .937081 F 1 14.2857 .133868 9 7 143 8 twm_c... income 7 86494.2 100909. 7 .937081 M 6 85.7142 .803212 9 7 857 9 twm_c... income 8 100909. 115325. 3 .401606 F 2 66.6666 .267737 9 6 4 667 6 twm_c... income 8 100909. 115325. 3 .401606 M 1 33.3333 .133868 9 6 4 333 8 twm_c... income 9 115325. 129741. 2 .267737 M 2 100 .267737 6 3 6 6 twm_c... income 10 129741. 144157 2 .267737 M 2 100 .267737 3 6z 6

Histogram Analysis Example #2 Data

xtbl xmin_nbr... xman_nbr... xmean_nbr... xstd_nbr... twm_c... 0 0 0 0 twm_c... 0 0 0 0 twm_c... 0 2 .7878788 .7690047 twm_c... 0 2 .5217391 .650723 twm_c... 0 3 1.6326531 1.1192905 twm_c... 0 3 1.5813953 1.1858185 twm_c... 0 5 1.3333333 1.5013222 twm_c... 0 5 1.5227273 1.3566107 twm_c... 0 5 .8653846 1.2093998 twm_c... 0 5 1.2222222 1.7497795 twm_c... 0 2 .9655172 .6939521 twm_c... 0 2 .8461538 .7174907 twm_c... 0 2 9.7560976E-02 .3698964 twm_c... 0 2 .1333333 .4268749 twm_c... 0 0 0 0 twm_c... 0 0 0 0

226 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Histogram

xtbl xmin_nbr... xman_nbr... xmean_nbr... xstd_nbr... twm_c... 0 0 0 0 twm_c... 0 0 0 0 twm_c... 0 0 0 0 twm_c... 0 0 0 0 twm_c... 0 3 .315 .7181748 twm_c... 0 3 .25 .6077155 twm_c... 0 5 1 1.132277 twm_c... 0 4 .7297297 1.0691914 twm_c... 0 5 .82 1.1779643 twm_c... 0 5 1.2413793 1.3684921 twm_c... 0 5 1.4666667 1.2578642 twm_c... 0 5 1.6666667 1.6080605 twm_c... 0 4 1.75 1.4215602 twm_c... 0 2 1 .8660254 twm_c... 0 4 1 1.4142136 twm_c... 0 3 .8461538 .9483714 twm_c... 1 1 1 0 twm_c... 0 2 .5 .7637626 twm_c... 0 2 1 1 twm_c... 0 0 0 0 twm_c... 1 2 1.0 .5 twm_c... 0 0 0 0

By default, the same two-dimensional graph shown in Tutorial #1 appears. 3. Go to the Graph Options tab. 4. Select the Show Overlay Counts and 3D Graph radio buttons. This graph is shown below.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 227 Chapter 4: Descriptive Statistics Histogram Histogram Analysis Example #2 Graph: Three Dimensional View

Note the same ranges for “age” column as before (“13” to “89”). Now, however, each bin has been overlaid with the distinct values of “gender” (“M” and “F”). The counts for each overlay are represented by height. Note that, for the first bin range of “age” (“18 to 20.6”), there are approximately 75 females (where “gender” = “F”) and 60 males (where “gender” = “M”). This image can be rotated either by double-clicking anywhere on the graph (automatic), or by the vertical and/or horizontal scroll-bars. When rotating, the number from 0-359 in the uppermost left-hand corner of the graph is the degrees of rotation about the z-axis. This value changes as the horizontal scroll-bar is adjusted. 5. Click on the Graph Options tab as described above to change the data being graphed from “age” to “income.” 6. Select the Show Bin Stats radio button. Note that this disables the Show Overlay Counts option as well as 3D Graph. This graph is shown here.

228 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Histogram Histogram Analysis Example #2 Graph

Note that the first bin range of the “income” variable is 0-14415.7. This is broken down into the two pieces, one for each overlay value (“M” and “F”). So, within this first bin of income, we have 200 females (where “gender” = “F”) and 132 males (where “gender” = “M”). Further, the statistics for “nbr_children” are shown. Note that the minimum value of nbr_children is 0, the maximum is 3, the mean is .315 and the standard deviation is .7181748. This is illustrated graphically by the orange square (mean), the wide blue bar (+/- one standard deviation), and the upper and lower blue line (minimum and maximum). Note that, since minus one standard deviation encompasses the minimum value, no lower blue line is shown.

Histogram - Example #3 1. Parameterize a Histogram analysis as follows: • Columns to Analyze ∘ twm_customer.age ∘ twm_customer.income • Histogram Style — Cross-Tab 2. Run the analysis in the same manner described above. This time, the following results should be generated. Again, the SQL is not shown.

Histogram Analysis Example #3 Table xbin_income xbeg_income xend_income xbin_age xbeg_age xend_age xcnt xpct 1 0 14415.7 1 13 20.6 132 17.6706827 1 0 14415.7 2 20.6 28.2 37 4.9531459

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 229 Chapter 4: Descriptive Statistics Histogram

xbin_income xbeg_income xend_income xbin_age xbeg_age xend_age xcnt xpct 1 0 14415.7 3 28.2 35.8 26 3.480589 1 0 14415.7 4 35.8 43.4 21 2.811245 1 0 14415.7 5 43.4 51 14 1.8741633 1 0 14415.7 6 51 58.6 11 1.4725569 1 0 14415.7 7 58.6 66.2 34 4.5515395 1 0 14415.7 8 66.2 73.8 16 2.1419009 1 0 14415.7 9 73.8 81.4 22 2.9451138 1 0 14415.7 10 81.4 89 19 2.5435074 2 14415.7 28831.4 1 13 20.6 8 1.0709505 2 14415.7 28831.4 2 20.6 28.2 17 2.2757697 2 14415.7 28831.4 3 28.2 35.8 31 4.1499331 2 14415.7 28831.4 4 35.8 43.4 32 4.2838019 2 14415.7 28831.4 5 43.4 51 36 4.8192771 2 14415.7 28831.4 6 51 58.6 25 3.3467202 2 14415.7 28831.4 7 58.6 66.2 21 2.811245 2 14415.7 28831.4 8 66.2 73.8 15 2.0080321 2 14415.7 28831.4 9 73.8 81.4 5 .669344 2 14415.7 28831.4 10 81.4 89 1 .1338688 3 28831.4 43247.1 2 20.6 28.2 2 .2677376 3 28831.4 43247.1 3 28.2 35.8 16 2.1419009 3 28831.4 43247.1 4 35.8 43.4 29 3.8821954 3 28831.4 43247.1 5 43.4 51 16 2.1419009 3 28831.4 43247.1 6 51 58.6 31 4.1499331 3 28831.4 43247.1 7 58.6 66.2 10 1.3386881 3 28831.4 43247.1 8 66.2 73.8 3 .4016064 3 28831.4 43247.1 9 73.8 81.4 1 .1338688 4 43247.1 57662.8 3 28.2 35.8 16 2.1419009 4 43247.1 57662.8 4 35.8 43.4 14 1.8741633 4 43247.1 57662.8 5 43.4 51 12 1.6064257 4 43247.1 57662.8 6 51 58.6 14 1.8741633 4 43247.1 57662.8 7 58.6 66.2 6 .8032129 4 43247.1 57662.8 8 66.2 73.8 1 .1338688

230 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Histogram xbin_income xbeg_income xend_income xbin_age xbeg_age xend_age xcnt xpct 5 57662.8 72078.5 3 28.2 35.8 1 .1338688 5 57662.8 72078.5 4 35.8 43.4 3 .4016064 5 57662.8 72078.5 5 43.4 51 5 .669344 5 57662.8 72078.5 6 51 58.6 11 1.4725569 6 72078.5 86494.2 3 28.2 35.8 2 .2677376 6 72078.5 86494.2 4 35.8 43.4 7 .9370817 6 72078.5 86494.2 5 43.4 51 2 .2677376 6 72078.5 86494.2 6 51 58.6 8 1.0709505 7 86494.2 100909.9 4 35.8 43.4 1 .1338688 7 86494.2 100909.9 5 43.4 51 2 .2677376 7 86494.2 100909.9 6 51 58.6 4 .5354752 8 100909.9 115325.6 5 43.4 51 1 .1338688 8 100909.9 115325.6 6 51 58.6 2 .2677376 9 115325.6 129741.3 6 51 58.6 2 .2677376 10 129741.3 144157 6 51 58.6 2 .2677376 3. Go to the Graph Options tab. 4. Select the 3D Graph radio buttons. This graph is shown below. Histogram Analysis Example #3 Graph: Three-Dimensional Display

Here, the three-dimensional graph shows a Cross-Tabulation of each bin of “income” and “age.” As an example, look at the tallest bar, the Cross-Tabulation of “income” within the range of 0-14415.7, with the

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 231 Chapter 4: Descriptive Statistics Histogram age column in the range of 13-20.6. There are approximately 130 occurrences within that particular Cross-Tabulation. 5. Click on the Graph Options tab as described above to change the axis that the columns are displayed on, and a two-dimensional stacked view of the cross-tabulated results.

Histogram - Example #4 1. Parameterize a Histogram analysis as follows: • Columns to Analyze — twm_customer.income • Bin Style — Bins with Boundaries ∘ Minimum — 10000 ∘ Maximum — 100000 2. Run the analysis. 3. When it completes, click in the RESULTS tab. For this example, the Histogram analysis generated the following results. Again, the SQL is not shown.

Histogram Analysis Example #4 Table

xtbl xcol xbin xbeg xend xcnt xpct twm_custome income 0 10000.00 252.00 33.73 r twm_custome income 1 10000.00 19000.00 152.00 20.35 r twm_custome income 2 19000.00 28000.00 114.00 15.26 r twm_custome income 3 28000.00 37000.00 82.00 10.98 r twm_custome income 4 37000.00 46000.00 56.00 7.50 r twm_custome income 5 46000.00 55000.00 25.00 3.35 r twm_custome income 6 55000.00 64000.00 26.00 3.48 r twm_custome income 7 64000.00 73000.00 8.00 1.07 r twm_custome income 8 73000.00 82000.00 11.00 1.47 r twm_custome income 9 82000.00 91000.00 10.00 1.34 r twm_custome income 10 91000.00 100000.00 4.00 0.54 r

232 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Adaptive Histogram xtbl xcol xbin xbeg xend xcnt xpct twm_custome income 11 100000.00 7.00 0.94 r

Histogram Analysis Example #4 Graph

Although 10 bins were requested, 12 were generated - bin 0 and bin 11. These bins represent those values less than the minimum specified (bin 0) and greater than the maximum specified (bin 11).

Adaptive Histogram The Adaptive Histogram analysis supplements the Histogram analysis by offering options to further subdivide the distribution. This analysis determines the frequency percentage above which a value should be treated as a Spike, and a similar percentage above which a bin is “Overpopulated.” A Spike is a specific value of a variable at which a disproportionately large (user defined) number of rows occurs, while an “Overpopulated Bin” is a range of values of a variable that contains a disproportionately large (user defined) number of rows. In this case, the Adaptive Histogram analysis modifies the computed equal sized bins to include a separate bin for each spike value and to further subdivide an overpopulated bin, returning counts and boundaries for each resulting bin. This subdivision is performed by first dividing by the same number of bins and then merging this with a subdivision in the region of the mean value within the bin. Subdivision near the mean is done by subdividing by the same number of bins the region around the mean, -/+ the standard deviation (if outside of the original bin then from the bin boundary). Subdividing may optionally be done using quantiles, giving approximately equally distributed bins. Adaptive binning is useful in making an initial investigation of the distribution of a column or columns in a table in order to decide what analysis to perform next. Without adaptive binning, spike values and/or overpopulated bins can distort the bin counts as they are not separated or subdivided without this option enabled. However, adaptive binning does not offer many of the specialized options that the normal

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 233 Chapter 4: Descriptive Statistics Adaptive Histogram Histogram analysis does, such as binning by width, quantile, boundary, or over multiple dimensions. Also, it does not allow use of overlay or statistics on other columns. Beginning range values are inclusive and generally all ending range values are exclusive except the last. There are some exceptions to this: • The last ending range value is inclusive. • The ending range value of a spike is inclusive (because the beginning and ending values of a spike are the same). • The beginning range value of a bin that follows and adjoins a spike is exclusive (since this value is the same as the spike value). • The ending range value of a quantile sub-bin is inclusive. An optional WHERE clause may be used to reduce the range of bins or to reduce the rows to bin in some other way. The Adaptive Histogram analysis is parameterized by specifying the table and column(s) to analyze, options unique to the Adaptive Histogram analysis, as well as specifying the desired results and SQL or Expert Options.

Note: For general information about output, see OUTPUT Tab.

Initiating an Adaptive Histogram Analysis Use the following procedure to initiate a new Adaptive Histogram analysis. 1. Click on the Add New Analysis icon in the toolbar. Add New Analysis from toolbar

2. In the resulting dialog box, double-click on the Adaptive Histogram icon.

234 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Adaptive Histogram Add New Analysis: Descriptive Statistics

The Adaptive Histogram dialog box appears, in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. • Adaptive Histogram - INPUT - Data Selection • Adaptive Histogram - INPUT - Analysis Parameters • Adaptive Histogram - INPUT - Expert Options • Adaptive Histogram - OUTPUT

Adaptive Histogram - INPUT - Data Selection 1. On the Adaptive Histogram dialog box, click on INPUT. 2. Click on data selection. Adaptive Histogram > Input > Data Selection

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 235 Chapter 4: Descriptive Statistics Adaptive Histogram analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. For more information, see INPUT Tab. • Select Columns From a Single Table ∘ Available Databases (or Analyses) — Choose the database (or analysis) from which you will select data tables. ∘ Available Tables — Select the table from which you will select columns. ∘ Available Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. Identity columns should not be selected for analysis (i.e., columns defined with the attribute GENERATE … AS IDENTITY).

Adaptive Histogram - INPUT - Analysis Parameters 1. On the Adaptive Histogram dialog box, click on INPUT. 2. Click on analysis parameters. Adaptive Histogram > Input > Analysis Parameters

3. On this screen select: • Adaptive Histogram Options ∘ Spike Threshold — A percentage of rows, expressed as an integer (1 to 100), above which an individual value of a variable will be identified as a separate bin. The default percentage is 10, (i.e., 10% of the total number of rows). Values that have this or a larger percentage of rows are identified as a Spike. ∘ Subdivision Threshold — A percentage of rows, expressed as an integer (0 to 100), above which a bin will be subdivided into sub-bins. The default percentage is 30, (i.e., 30% of the total number of rows). Bins that have this or a larger percentage of rows are subdivided into sub-bins using an algorithm that uses means and standard deviations. • Subdivision Method ∘ Means — Option to subdivide overpopulated bins using means and standard deviations. ∘ Quantiles — Option to subdivide overpopulated bins using quantiles. • Bin Values for Selected Columns — Each column selected for the Adaptive Histogram analysis appears in this list, along with the default bin values, depending upon the Bin Style selected. Next to Column Name, the following will appear: ∘ Bins — If Bins is selected, 10 appears as the number of bins to generate next to the column selected for the Histogram analysis. Click on the Change… button to change the desired number of equal sized data bins. Entry must be an integer greater than 0.

236 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Adaptive Histogram Adaptive Histogram - INPUT - Expert Options 1. On the Adaptive Histogram dialog box, click on INPUT. 2. Click on expert options. Adaptive Histogram > Input > Expert Options

This screen has the following option: • WHERE Clause text — Option to generate a SQL WHERE clause(s) to restrict rows selected for analysis.

Adaptive Histogram - OUTPUT Before running the analysis define Output options. 1. On the Adaptive Histogram dialog box, click on OUTPUT. Adaptive Histogram > Output

2. On this screen, select: • Use the Teradata Explain Feature… — Option to generate a SQL EXPLAIN SELECT statement, which returns a Teradata Execution Plan to the RESULTS tab. • Store the tabular output of this analysis in the database — Option to generate a Teradata TABLE or VIEW populated with the results of the analysis. Once enabled, the following three fields must be specified: ∘ Database Name — Text box to specify the name of the Teradata database where the resultant Table or View will be created in. By default, this is the “Result Database.” ∘ Output Name — Text box to specify the name of the Teradata Table or View. ∘ Output Type — Pull-down to specify Table or View. Although a Table is always available, Views can not be specified due to limitations in Teradata. ∘ Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. For more information, see Stored Procedure Support (Teradata Database). ∘ Procedure Comment — When an optional procedure comment, is entered it is applied to a requested stored procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags , and ,

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 237 Chapter 4: Descriptive Statistics Adaptive Histogram respectively). The default value of this field may be set on the Defaults tab of the Preferences dialog box, available from the Tools > Preferences menu option. ∘ Create output table using the FALLBACK keyword — If a table is selected, it will be built with FALLBACK if this option is selected. ∘ Create output table using the MULTISET keyword — If a table is selected, it will be built as a MULTISET table if this option is selected. ∘ Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. For more information, see Advertise Output. ∘ Advertise Note — An advertise note may be specified if desired when the Advertise Output option is selected, or when the Always Advertise option is selected on the Connection Properties dialog bix. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate the SQL for this analysis, but do not execute it — If this option is selected, the analysis will only generate SQL, returning it and terminating immediately.

Running the Adaptive Histogram Analysis After setting parameters on the INPUT and OUTPUT screens as described above, you are ready to run the analysis. 1. To run the analysis, you can either:

• Click the Run icon on the toolbar, or • Select Run on the Project menu, or • Press the F5 key on your keyboard

Results - Adaptive Histogram The results of running the Teradata Warehouse Miner Adaptive Histogram analysis include the generated SQL itself, the results of executing the generated SQL, two dimensional histograms, and, if the Create Table (or View) option is chosen, a Teradata table (or view). All of these results are outlined below. • Adaptive Histogram - RESULTS - Data • Adaptive Histogram - RESULTS - Graph • Adaptive Histogram - RESULTS - SQL

Adaptive Histogram - RESULTS - Data 1. On the Adaptive Histogram dialog box, click on RESULTS. 2. Click on data (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed)

238 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Adaptive Histogram Adaptive Histogram > Results > Data

Adaptive Histogram Analysis Data

Name Type Definition xtbl VARCHAR (30) Table that the bin values are computed against, as specified in the “table” parameter. xcol VARCHAR (30) Column name or optional alias that the bin values were computed against, as specified by the “column” parameter. xbeg FLOAT Value that represents the beginning boundary for the bin of the column specified. xend FLOAT Value that represents the ending boundary for the bin of the column specified. xtype BYTEINT The category of the bin calculated. In the range 0-3, where: • 0 (Spike) — A singular value that occurs more than the given % in the Adaptive Binning option. • 1 (Bin) — An equal sized bin computed on the first pass. • 2 (Sub-Bin) — A subdivision by the same number of bins of a computed equal size bin that was larger than the percentage specified in the Adaptive Binning option. • 3 (Mean-Bin) — A subdivision near the mean of a computed equal size bin that was larger than the percentage specified in the Adaptive Binning option. Subdividing by the same number of bins, the region around the mean, -/+ the standard deviation (if outside of the original bin then from the bin boundary). xdesc CHAR (5) The description of the category of the bin calculated. Valid values are: • spike — Corresponds to xtype 0 above for a spike • bin — Corresponds to xtype 1 above for a bin • --bin — Corresponds to xtype 2 above for a sub-bin • **bin — Corresponds to xtype 3 above for a mean-bin xcnt FLOAT Number of records in the bin.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 239 Chapter 4: Descriptive Statistics Adaptive Histogram

Name Type Definition Xpct FLOAT Percentage of total records that this bin represents.

Adaptive Histogram - RESULTS - Graph 1. On the Adaptive Histogram dialog box, click on RESULTS. 2. Click on graph (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed). Adaptive Histogram > Results > Graph

There are several different graphs available for the Adaptive Histogram analysis depending upon the parameterization. These graphs are scrollable, with up to 16 bin counts shown per viewing area, unless the Zoom Out (All Data) option is selected by right clicking on the graph image. Note that there is the potential to generate a NULL bin (i.e., a bin with no occurrences) or spike. In that case, the NULL bin(s) and/or spike(s) are displayed as such. The Adaptive Histogram analysis graph options are discussed in the following sections: • Adaptive Histogram Graph Drill Down Functions • Show Graph • Adaptive Histogram Data Grid • Adaptive Histogram Graph Options

Adaptive Histogram Graph Drill Down Functions See Histogram - RESULTS - Graph.

Show Graph The following right-click options are available: • Maximize, Print, Export and Zoom Out — The standard options Maximize, Print, Export and Zoom Out are described in RESULTS Tab.

Adaptive Histogram Data Grid A data-grid control is used to select the data to display. Much like a Microsoft Excel pivot-table, the Data Grid has the following properties: • Select — The data to be graphed can be selected by either clicking in the left upper most square of the Data Grid to select the entire data set, or holding the left mouse button down, dragging over the desired rows and releasing the mouse button. The rows highlighted will be graphed automatically.

240 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Adaptive Histogram • Sort — To sort the data in the Data Grid, click the right mouse button on any column header. The sort is always done from left to right with respect to the columns of data. Subsequent right mouse clicks toggle the sort from ascending to descending. • Pivot — Pivoting of data is accomplished by holding the left mouse button down on the column header that you wish to pivot, dragging the column to its new desired position and releasing the mouse button. The pivoted column will appear to the left of the column that it was dragged on top of. For the Histogram analysis, the column headers can take on the following values: • Bin Range — The high and low values in each bin. • Description — This column header contains the words “spike” or “bin” depending upon how the data was binned. See Adaptive Histogram and Tutorial - Adaptive Histogram Analysis. • Single Count — The number of occurrences within the specified bin.

Adaptive Histogram Graph Options The following options are available, depending upon the parameterization of the Histogram analysis: • Select Bin Column — Pull-down list containing variable names specified by the Columns to Analyze option. • << Back to bins — Option button enabled on the Show Graph tab when a subdivided bin has been drilled into by clicking on it. A subdivided bin is color-coded and can be drilled into by clicking on the distribution bar. See the Tutorial - Adaptive Histogram Analysis.

Adaptive Histogram - RESULTS - SQL 1. On the Adaptive Histogram dialog box, click on RESULTS. 2. Click on SQL (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed). Adaptive Histogram > Results > SQL

The SQL generated for the analysis is returned here on this screen as text, which can be copied by using the Select All and Copy buttons.

Tutorial - Adaptive Histogram Analysis

Adaptive Histogram - Example #1 1. Parameterize an Adaptive Histogram analysis as follows: • Columns to Analyze — twm_customer.income • Spike Threshold — 10 • Subdivision Threshold — 30

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 241 Chapter 4: Descriptive Statistics Adaptive Histogram • Subdivision Method — Means • Number of Bins — 10 2. Run the analysis. 3. When it completes, click in the RESULTS tab. For this example, the Adaptive Histogram analysis generated the following results. Note that the SQL is not shown for brevity.

Adaptive Histogram Analysis Example #1 Data

xtbl xcol xbeg xend xtype xdesc xcnt xpct TWM_CUSTOMER income 0 0 0 spike 102 13.6546185 TWM_CUSTOMER income 1039 15350.8 1 bin 243 32.5301205 TWM_CUSTOMER income 1039 2470.18 2 --bin 10 1.3386881 TWM_CUSTOMER income 2470.18 3901.36 2 --bin 23 3.0789826 TWM_CUSTOMER income 3901.36 4956.7726871 2 --bin 15 2.0080321 TWM_CUSTOMER income 4956.7726871 5696.7572443 3 **bin 4 .5354752 TWM_CUSTOMER income 5696.7572443 6436.7418016 3 **bin 22 2.9451138 TWM_CUSTOMER income 6436.7418016 7176.7263588 3 **bin 13 1.7402945 TWM_CUSTOMER income 7176.7263588 7916.710916 3 **bin 18 2.4096386 TWM_CUSTOMER income 7916.710916 8656.6954733 3 **bin 18 2.4096386 TWM_CUSTOMER income 8656.6954733 9396.6800305 3 **bin 16 2.1419009 TWM_CUSTOMER income 9396.6800305 10136.6645877 3 **bin 13 1.7402945 TWM_CUSTOMER income 10136.6645877 10876.6491449 3 **bin 10 1.3386881 TWM_CUSTOMER income 10876.6491449 11616.6337022 3 **bin 12 1.6064257 TWM_CUSTOMER income 11616.6337022 12356.6182594 3 **bin 17 2.2757697 TWM_CUSTOMER income 12356.6182594 12488.44 2 --bin 6 .8032129 TWM_CUSTOMER income 12488.44 13919.62 2 --bin 30 4.0160643 TWM_CUSTOMER income 13919.62 15350.8 2 --bin 16 2.1419009 TWM_CUSTOMER income 15350.8 29662.6 1 bin 194 25.9705489 TWM_CUSTOMER income 29662.6 43974.4 1 bin 104 13.9223561 TWM_CUSTOMER income 43974.4 58286.2 1 bin 54 7.2289157 TWM_CUSTOMER income 58286.2 72598 1 bin 18 2.4096386 TWM_CUSTOMER income 72598 86909.8 1 bin 19 2.5435074 TWM_CUSTOMER income 86909.8 101221.6 1 bin 7 .9370817 TWM_CUSTOMER income 101221.6 115533.4 1 bin 2 .2677376 TWM_CUSTOMER income 115533.4 129845.2 1 bin 2 .2677376

242 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Adaptive Histogram xtbl xcol xbeg xend xtype xdesc xcnt xpct TWM_CUSTOMER income 129845.2 144157 1 bin 2 .2677376

By default, the Adaptive Histogram Graph page should display a two-dimensional graph showing the distribution of the “income” column, as shown below. Adaptive Histogram Analysis Example #1 Graph

This two-dimensional view shows the distribution of the “income” column (lower graph) as well as the range of values within each bin (upper graph). Also on the upper graph is an indicator that signals either a data spike (red triangle) or a bin that has been subdivided (purple range of values). Note that the value of “0” is a spike, defined as having 10% or more occurrences overall. The second income bin, in the range of 1039-15350.8, has more than 30% of the values and has therefore been subdivided. This subdivision can be displayed on a separate graph by left mouse click followed by selection of “sub-bins” on the blue distribution bar within the subdivided range of 1039-15350.8:

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 243 Chapter 4: Descriptive Statistics Correlation Matrix Adaptive Histogram Analysis Example #1 Graph: Subdivision

This histogram shows the distribution of data within the bin range 1039-15350.8, along with a range of values for each subdivision. 4. Click on << back to Histogram to go back to the original distribution and re-enable the Graph Options tab.

Correlation Matrix The Correlation Matrix analysis (not available in the Teradata Profiler product) allows you to build and view a Pearson Product-Moment Correlation matrix. A Pearson Product-Moment Correlation value is calculated for each pairwise combination of columns within the selected table. This is calculated as follows, for each pairwise combination of columns X and Y.

where n is the total number of rows in the calculation. The Correlation Matrix analysis must operate on numeric data. Columns of type DATE will not produce meaningful results. For NULL values, listwise deletion is automatically performed. This means that if the value of any column to be included in the matrix is NULL, the entire row is omitted during matrix calculations. Additionally, the matrix width, or the maximum number of select list items in each SQL statement, is set to 35, while the number of threads or simultaneous connections to the data source is set to 5.

244 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Correlation Matrix Note that this does not result in a Matrix being saved in metadata for use in Export Matrix, Linear Regression and Factor Analysis. Use the Matrix Function (Matrix) for this. The Matrix function should also be used if you desire a different matrix width or number of connections.

Initiating a Correlation Matrix Analysis Use the following procedure to initiate a new Correlation Matrix analysis. 1. Click on the Add New Analysis icon in the toolbar. Add New Analysis from toolbar

2. In the resulting Add New Analysis dialog box, under Categories, click on Descriptive Statistics. 3. Under Analyses, double-click on the Correlation Matrix icon. Add New Analysis: Descriptive Statistics

The Correlation Matrix dialog box appears, in which you will enter INPUT options as described in the following section. • Correlation Matrix - INPUT

Correlation Matrix - INPUT 1. On the Correlation Matrix dialog box, click on INPUT.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 245 Chapter 4: Descriptive Statistics Correlation Matrix Correlation Matrix > Input

2. On this screen, select: • Select Input Source — Users who are not using the Teradata Profiler program may select between different sources of input. By selecting the Table option, the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Analysis option, however, the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases, the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. For more information, see INPUT Tab. • Select Columns From a Single Table ∘ Available Databases (or Analyses) — Choose the database (or analysis) from which you will select data tables. ∘ Available Tables — Select the table from which you will select columns. ∘ Available Columns — Columns available from table selected above. ∘ Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window.

Running the Correlation Matrix Analysis After setting parameters on the INPUT screen as described above, you are ready to run the analysis. 1. To run the analysis, you can either:

• Click the Run icon on the toolbar, or • Select Run on the Project menu, or • Press the F5 key on your keyboard

Results - Correlation Matrix 1. On the Correlation Matrix dialog box, click on RESULTS (note that the RESULTS tab will be grayed- out/disabled until after the analysis is completed).

246 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Overlap Correlation Matrix > Results

Results data, if any, is displayed in a data grid as described in RESULTS Tab. The results for the Correlation Matrix are displayed for each pairwise combination of columns within the selected table. The correlation value ranges from -1 to 1 where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.

Tutorial - Correlation Matrix Analysis

Correlation Matrix - Example #1 1. Parameterize a Correlation Matrix analysis as follows: • Available Tables — TWM_CUSTOMER_ANALYSIS • Selected Columns — Select all Numeric Columns 2. Run the analysis. 3. Click on Results to view the Correlation Matrix as follows Note that the entire matrix is not shown for brevity.

Correlation Matrix Results Example #1 Data

avg_cc_tran avg_cc_tran avg_ck_tra age avg_cc_bal _amt _cnt avg_ck_bal n_amt ... age 1.0000 ? ? ? ? ? ... avg_cc_bal -0.1452 1.0000 ? ? ? ? ... avg_cc_tran_amt 0.0389 -0.2318 1.0000 ? ? ... avg_cc_tran_cnt 0.0996 -0.4919 0.1720 1.0000 ? ? ... avg_ck_bal 0.0709 -0.3428 0.1626 0.0859 1.0000 ? ... avg_ck_tran_amt 0.1529 -0.4209 0.2404 0.1429 0.6371 1.0000 ......

Overlap Overlap analysis is designed to make it safer to combine information in multiple tables into an analytic data set. It does this by providing counts of overlapping key fields amongst pairs of tables. For example, if an analytic data set is being built to describe customers, it is useful to know whether the customer, account and transaction tables that provide information about customers actually refer to the same customers.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 247 Chapter 4: Descriptive Statistics Overlap Given a column name and a list of table names, the Overlap analysis determines the number of instances of that column which each pair-wise combination of tables has in common. The same may also be performed for multiple columns taken together. The Overlap analysis is parameterized by specifying the table and column(s) to analyze, options unique to the Overlap analysis, as well as specifying the desired results and SQL or Expert Options.

Note: For general information about output, see OUTPUT Tab.

Initiating an Overlap Analysis Use the following procedure to initiate a new Overlap analysis. 1. Click on the Add New Analysis icon in the toolbar. Add New Analysis from toolbar

2. In the resulting Add New Analysis dialog box, double-click on the Overlap icon. Add New Analysis: Descriptive Statistics

The Overlap dialog box appears, in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. • Overlap - INPUT - Data Selection

248 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Overlap • Overlap - INPUT - Analysis Parameters • Overlap - OUTPUT

Overlap - INPUT - Data Selection 1. On the Overlap dialog box, click on INPUT. 2. Click on data selection. Overlap > Input > Data Selection

3. On this screen, select: • Select Input Source — Users who are not using the Teradata Profiler program may select between different sources of input. By selecting the Table option, the user can select from available databases, tables (or views) and columns in the usual manner. By selecting the Analysis option, however, the user can select directly from the output of another analysis of qualifying type in the current project. Analyses that may be selected from directly include all of the Analytic Data Set (ADS) and Reorganization analyses (except Refresh). In place of Available Databases, the user may select from Available Analyses, while Available Tables then contains a list of all the output tables that will eventually be produced by the selected analysis, or it contains a single entry with the name of the analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. For more information, see INPUT Tab. • Select Overlap Columns From Multiple Tables ∘ Available Databases (or Analyses) — Choose the database (or analysis) from which you will select data tables. ∘ Available Tables — Select the table from which you will select columns. ∘ Available Columns — Columns available from table selected above. ∘ Selected Columns — Select columns by highlighting and then either dragging and dropping into the Selected Overlap Columns window, or click on the arrow button to move highlighted columns into the Selected Overlap Columns window.

Overlap - INPUT - Analysis Parameters 1. On the Overlap dialog box, click on INPUT. 2. Click on analysis parameters. Overlap > Input > Analysis Parameters

3. On this screen, specify the columns from each table which will used for the overlap index:

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 249 Chapter 4: Descriptive Statistics Overlap • Available Tables — Select the table from which you will select columns. • Available Columns — Select the key columns to use for this table for the Overlap analysis. Drag and drop them into the Selected Overlap Index Columns window, or click on the arrow button to move highlighted columns into the Selected Overlap Index Columns window. • Selected Overlap Index Columns — For each table, a different key column can be used for the Overlap analysis. Although they can have different names, there must be the same number of columns specified.

Overlap - OUTPUT Before running the analysis, define Output options. 1. On the Overlap dialog box, click on OUTPUT: Overlap > Output

2. On this screen, select from the following options. If no options are specified, by default a SELECT statement is generated. • Use the Teradata EXPLAIN Feature… — Option to generate a SQL EXPLAIN SELECT statement, which returns a Teradata Execution Plan to the RESULTS tab. • Store the tabular output of this analysis in the database — Option to generate a Teradata TABLE or VIEW populated with the results of the analysis. Once enabled, the following three fields must be specified: ∘ Database Name — Text box to specify the name of the Teradata database where the resultant Table or View will be created in. By default, this is the “Result Database.”Output Name — Text box to specify the name of the Teradata Table or View. ∘ Output Type — Pull-down to specify Table or View. ∘ Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This creates a stored procedure in the user's login database in place of the execution of the SQL generated by the analysis. For more information, see Stored Procedure Support (Teradata Database). ∘ Procedure Comment — When an optional procedure comment is entered, it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags , and , respectively). The default value of this field may be set on the Defaults tab of the Preferences dialog box, available from the Tools > Preferences menu option). ∘ Create output table using the FALLBACK keyword — If a table is selected, it will be built with FALLBACK if this option is selected. ∘ Create output table using the MULTISET keyword — If a table is selected, it will be built as a MULTISET table if this option is selected. ∘ Advertise Output — The Advertise Output option may be requested when creating a table, view or procedure. This feature “advertises” output by inserting information into one or more of the

250 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Overlap Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. For more information, see Advertise Output. ∘ Advertise Note — An advertise note may be specified when the Advertise Output option is selected, or when the Always Advertise option is selected on the Connection Properties dialog box. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output. • Generate the SQL for this analysis, but do not execute it — If this option is selected, the analysis will only generate SQL, returning it and terminating immediately.

Running the Overlap Analysis After setting parameters on the INPUT and OUTPUT screens as described above, you are ready to run the analysis. 1. To run the analysis, you can either:

• Click the Run icon on the toolbar, or • Select Run on the Project menu, or • Press the F5 key on your keyboard

Results - Overlap Analysis The results of running the Overlap analysis include the generated SQL itself, the results of executing the generated SQL, and, if the Create Table (or View) option is chosen, a Teradata table (or view). All of these results are outlined below. • Overlap - RESULTS - Data • Overlap - RESULTS - SQL

Overlap - RESULTS - Data 1. On the Overlap dialog box, click on RESULTS. 2. Click on data (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed). Overlap > Results > Data

Results data, if any, is displayed in a data grid as described in RESULTS Tab. The following is a description of the results returned by the analysis. Note that the number of tables selected affects the structure of the table. If an output table is created, the columns in bold below comprise the Unique Primary Index (UPI) of the output table.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 251 Chapter 4: Descriptive Statistics Overlap Overlap Results Data

Name Type Definition xidx BYTEINT Generated table identification number. Note that the values 1 and 2 are reserved for the number of records and number of unique records in the table xtable CHAR (8) Table that the overlap key resides in, as specified by the “table” parameter. Note that “#records” “#uniques” are included as pseudo-table names so that COL will consist of the number of records and number of unique records in the table. The length of this attribute is a minimum of 8 or a maximum of all table/ table alias names. COL FLOAT Count of the overlapping records between each pair-wise combination of tables. There will be a COL column generated for each table defined.

Overlap - RESULTS - SQL 1. On the Overlap dialog box, click on RESULTS. 2. Click on SQL (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed). Overlap > Results > SQL

On this screen, the generated SQL is returned as text, which can be copied by using the Select All and Copy buttons.

Tutorial - Overlap Analysis

Overlap - Example #1 1. Parameterize an Overlap analysis as follows: • Selected Overlap Columns ∘ TWM_CUSTOMER.cust_id ∘ TWM_CHECKING_ACCT.cust_id ∘ TWM_CHECKING_TRAN.cust_id ∘ TWM_CREDIT_ACCT.cust_id ∘ TWM_CREDIT_TRAN.cust_id ∘ TWM_SAVINGS_ACCT.cust_id

252 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Scatter Plot ∘ TWM_SAVINGS_TRAN.cust_id 2. Run the analysis. 3. When it completes, click the RESULTS tab. For this example, the Overlap analysis generated the following results. Note that the SQL is not shown for brevity.

Overlap Analysis Example #1 Data (Part 1)

TWM_CUSTO TWM_CHECKI TWM_CHECKIN TWM_CREDIT_ xidx xtable MER NG_ACCT G_TRAN ACCT 1 #records 747 520 46204 468 2 #uniques 747 520 520 468 3 TWM_CUSTOMER 520 520 468 4 TWM_CHECKING_AC 520 384 CT 5 TWM_CHECKING_TRA 384 N 6 TWM_CREDIT_ACCT 7 TWM_CREDIT_TRAN 8 TWM_SAVINGS_ACCT

Overlap Analysis Example #1 Data (Part 2)

TWM_SAVINGS_ACC TWM_SAVINGS_TRA xidx xtable TWM_CREDIT_TRAN T N 1 #records 20167 421 11189 2 #uniques 457 421 420 3 TWM_CUSTOMER 457 421 420 4 TWM_CHECKING_AC 373 315 315 CT 5 TWM_CHECKING_TR 373 315 315 AN 6 TWM_CREDIT_ACCT 457 297 297 7 TWM_CREDIT_TRAN 287 287 8 TWM_SAVINGS_ACCT 420

Scatter Plot Scatter plots are useful to identify relationships and outliers across two and/or three different variable combinations. These types of plots are used to investigate the possible relationship between two or three

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 253 Chapter 4: Descriptive Statistics Scatter Plot variables that both relate to the same “event”. Often, inferences can be made depending upon the cluster of points within the scatter plot. For example: • There may be a positive correlation if the points are clustered in a band running from lower left (i.e., (0,0)) to upper right. • There may be a negative correlation if the points are clustered in a band running from upper left to lower right. • If a straight line or curve can be drawn through the data so that it “fits” as well as possible, the more the points cluster closely around the imaginary line of best fit, the stronger the relationship that exists between the two variables. The Scatter Plot analysis can readily be applied to any type of numeric data, using the Teradata SAMPLE extension to plot a random selection of data points across two or three dimensions. The axes are scaled based on the range of data returned from the sample size specified. A practical limit has been set at 30000 data points. The Scatter Plot analysis is parameterized by specifying the databases, tables and columns to analyze, options unique to the Scatter Plot analysis, as well as specifying the desired results and SQL or Expert Options.

Initiating a Scatter Plot Analysis Use the following procedure to initiate a new Scatter Plot analysis. 1. Click on the Add New Analysis icon in the toolbar. Add New Analysis from toolbar

2. In the resulting Add New Analysis dialog box, double-click on the Scatter Plot icon.

254 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Scatter Plot Add New Analysis: Descriptive Statistics

The Scatter Plot dialog box appears, in which you will enter INPUT options to parameterize the analysis as described in the following sections. • Scatter Plot - INPUT - Data Selection • Scatter Plot - INPUT - Join Columns • Scatter Plot - INPUT - Analysis Parameters • Scatter Plot - INPUT - Expert Options

Scatter Plot - INPUT - Data Selection 1. On the Scatter Plot dialog box, click on INPUT. 2. Click on data selection. Scatter Plot > Input > Data Selection

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 255 Chapter 4: Descriptive Statistics Scatter Plot analysis under the label Volatile Table, representing the output of the analysis that is ordinarily produced by a Select statement. For more information, see INPUT Tab. • Select Columns From Any Number of Tables ∘ Available Databases (or Analyses) — Choose each database (or analysis) from which you will select data tables or views. Selected columns may come from tables or views in different databases or analyses. ∘ Available Tables — Select each table or view from which you will select columns. Selected columns may come from different tables or views. ∘ Available Columns — The columns which the Scatter Plot analysis will be executed against. Any number of columns can be selected, from which two or three may be selected for plotting at a time from within the graphics interface. Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window. ∘ Selected Columns — All the columns (any number) specified for the Scatter Plot analysis.

Scatter Plot - INPUT - Join Columns 1. On the Scatter Plot dialog box, click on INPUT. 2. Click on join columns. Scatter Plot > Input > Join Columns

This screen contains the join columns for the tables or views selected in the data selection panel. The join columns are used to join together the tables or views containing the requested columns whenever columns from more than one table or view are being analyzed in the graphics interface. The same number of columns must be present for each requested table or view, as they are matched by position, one for one. If possible, the join columns are by default set automatically to the primary index columns of the requested tables. The primary index columns cannot be determined automatically for requested views. • Available Tables — Each table or view that contains the columns which the Scatter Plot analysis will be executed against. The table or view names are qualified (preceded) by the name of the database that contains them. • Available Columns — The columns that are candidates for “join columns” that should be used to join together the tables or views that contain the requested columns. Select columns by highlighting and then either dragging and dropping into the Selected Join Columns window, or click on the arrow button to move highlighted columns into the Selected Join Columns window. • Selected Join Columns — The columns that should be used to join together the tables or views that contain the requested columns for scatter plot analysis. The same number of columns must be present for each requested table or view. When joining tables or views together, the join columns are matched one to one in the order listed.

256 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Scatter Plot Scatter Plot - INPUT - Analysis Parameters 1. On the Scatter Plot dialog box, click on INPUT. 2. Click on analysis parameters. Scatter Plot > Input > Analysis Parameters

3. On this screen, select: • Number of Rows to Sample — By default, the Scatter Plot analysis takes a sample of 1000 data points to create the Scatter Plot graphic. This can be changed here, although a practical limit has been set at 30000. A value greater than this limit will result in a warning, which can be overridden.

Scatter Plot - INPUT - Expert Options 1. On the Scatter Plot dialog box, click on INPUT. 2. Click on expert options. Scatter Plot > Input > Expert Options

This screen has the following option: • WHERE Clause text — Option to generate a SQL WHERE clause(s) to restrict rows selected for analysis.

Running the Scatter Plot Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. 1. To run the analysis, you can either:

• Click the Run icon on the toolbar, or • Select Run on the Project menu, or • Press the F5 key on your keyboard

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 257 Chapter 4: Descriptive Statistics Scatter Plot Results - Scatter Plot The results of running the Teradata Warehouse Miner Values analysis include two and three-dimensional Scatter Plot graphs. These are described in Scatter Plot - RESULTS.

Scatter Plot - RESULTS 1. On the Scatter Plot dialog box, click on RESULTS (note that the RESULTS tab will be grayed-out/ disabled until after the analysis is completed). Scatter Plot > Results

Either two or three-dimensional scatter plots are available by selecting two or three columns at a time from among the set of columns originally requested for analysis. Variables are selected to be graphed on the graph options tab. • 2D Graph -- show regression line — If checked and there are only two selected columns, a regression or fit line is displayed on the graph. • Data Point Size — The size of the data points on the graph can be controlled by selecting small, medium or large as the data point size. • Available Columns — The columns selected on the data selection panel. • Selected Columns — The columns selected to be graphed by the Scatter Plot analysis. If two variables are selected here, a two-dimensional view is given. Three variables generate a three-dimensional graph. 2. To view the requested graph, click on the show graph tab. See Scatter Plot - Example #1.

Tutorial - Scatter Plot Analysis

Scatter Plot - Example #1 1. Parameterize a Scatter Plot analysis as follows: • Columns to Analyze ∘ twm_savings_tran.new_balance ∘ twm_savings_tran.principal_amt ∘ twm_savings_tran.interest_amt 2. Run the analysis. 3. When it completes, click the RESULTS tab. For this example, the Scatter Plot analysis generates the following results, graphing by default the first two columns to analyze (“new_balance” and “principal_amt”).

258 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Scatter Plot Scatter Plot Analysis Example #1 Graph

4. Click on the Graph Options tab. 5. Select “interest_amt” in addition to “new_balance” and “principal_amt.” This creates a three-dimensional view. Scatter Plot Analysis Example #1 Graph: Three-Dimensional View

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 259 Chapter 4: Descriptive Statistics Data Explorer Data Explorer The Data Explorer performs basic statistical analysis on a set of selected tables or on selected columns from selected tables in one or more databases. It stores results from four fundamental types of analysis based on simplified versions of the Descriptive Statistics analyses: • Values • Statistics • Frequency • Histogram An answer table is produced for each requested type of analysis, the output including requested table names and column names in order to allow results from multiple tables to be included in each answer table. Each analysis can be selected individually, with the following exceptions: 1. If Frequency is selected, Values must be selected. 2. If Histogram is selected, Values must be selected and Statistics must be selected including the Count, Minimum, Maximum, Mean and Standard Deviation. Data Explorer includes intelligence about which functions should be performed on which columns, with decisions based partly on column type and partly on results obtained. It also includes performance enhancements resulting in minimal passes on the input data. You may also specify a separate SQL Where Clause to apply to each of the input tables selected for analysis. The Data Explorer normal processing scheme is outlined below. Note that underlined values given in the following topics are threshold values which can be set by the user. The program first builds up to four output tables, then the steps below are applied to each requested input table, one at a time. If parallel processing is requested, however, the tables are, in a sense, processed n at a time, where n is the number of tables to process in parallel. That is, the program establishes n threads and performs the steps below for each input table in a separate thread until all tables are processed.

Note: For general information about output, see OUTPUT Tab.

Data Explorer - Values Analysis If requested, a Values analysis is performed on every requested column in the set of input tables. The measures provided include the count of non-null values, null values, unique values, blank values (character types only), and zero, positive and negative values (numeric types only).

Note: The analysis parameter “Compute unique values for each column selected” must be selected in order for the unique values measure to be returned with a non-null value.

The general strategy in computing the Values function is to combine as many of the counts and measures for the various columns in as few Select statements as possible. Results from each Select statement are automatically placed in a temporary table; that is, each Select is actually an Insert-Select statement. The data for possibly multiple columns is then reorganized by way of Insert-Select statements that move each variable’s results one at a time into the final answer table. Note that distinct counts are performed separately and updated in the answer table in a final step. If the count of distinct values should fail, such as when a column is too wide, the count for that column can be set to null without affecting other results.

260 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Data Explorer Data Explorer - Statistics Analysis If requested, a Statistics analysis is performed on every requested column of numeric or date type. You may select the statistics to be calculated but the minimum, maximum, mean and standard deviation are always calculated if the Histogram analysis is selected. Other measures available include skewness, kurtosis, standard error, coefficient of variance, variance, sum, uncorrected sum of squares and corrected sum of squares. You may choose to compute sample statistics or population measures. For columns of type date, the minimum, maximum and mean dates are converted to integers that look like dates in a ‘YYYYMMDD’ style, such as 20020823 for 2002-08-23, and other measures such as standard deviation are computed in units of days. Sum and sum of squares measures for dates are in terms of days since 1900 and are presumably not very useful. The general strategy in computing the Statistics function is to combine as many of the counts and measures for the various columns in as few Select statements as possible. Results from each Select statement are automatically placed in a temporary table; that is, each Select is actually an Insert-Select statement. The data for possibly multiple columns is then reorganized by way of Insert-Select statements that move each variable’s results one at a time into the final answer table. In computing statistical measures, the Teradata aggregations for minimum, maximum, mean, standard deviation, skew, and kurtosis are used. When population measures are requested rather than sample statistics, formulas expressing population skew and population kurtosis in terms of their sample counterparts are used since these measures are not provided directly in Teradata.

Data Explorer - Frequency Analysis If a Frequency analysis is requested, and the option to “Compute unique values for each column selected” is also requested along with the Values analysis, a Frequency analysis is performed on every requested numeric and date type column that has less than or equal to a user specified number of unique values (by default 20), and on every character type column that has less than or equal to a user specified number of unique values (by default 100). Character type columns with more values can be analyzed with a restricted Frequency analysis which returns only 'prominent' values that occur in greater than or equal to a user determined x % of rows (by default 1%), provided the ratio of unique values to rows is less than 100 - x % (by default 99%). The option to perform a restricted Frequency analysis, as well as the threshold values underlined above, can be set on the expert options tab. If both restricted and regular frequency processing are to be performed, restricted frequency processing is actually performed first in order to facilitate restart processing, should it become necessary. Once restricted frequency processing is performed, a strategy for efficiently calculating regular frequencies must be determined. One strategy is simply to calculate each frequency individually (i.e., one at a time). The other strategy is to combine columns into an intermediate table of counts and then select individual column frequencies from the intermediate table. This can enhance performance dramatically in cases where there are not too many combinations of values and where there are enough rows to make the effort worth while. Too many combined values can, however, lead to greatly degraded performance. Two parameters control the calculation strategy for regular frequency processing. • The minimum number of rows to use the combining strategy with, by default 25000. • The maximum number of possible combined values in combined columns, by default 10,000. In order to use this parameter, the columns to analyze are first placed in ascending order based on the number of values in the columns, as previously calculated in the Values analysis. Then, the number of possible combined values is calculated as the running product of the number of values in successive

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 261 Chapter 4: Descriptive Statistics Data Explorer columns. As many columns are combined as possible without exceeding the parameter for the maximum number of combined values. Any left over single columns are processed individually.

Note: Data is inserted into a volatile table first to avoid lock contention on the final result table when multiple threads are used. Also, the threshold values underlined above can be set on the expert options tab.

Data Explorer - Histogram Analysis If requested, a Histogram analysis is performed on every numeric and date column for which a Frequency analysis was not performed. The analysis is performed either with a user specified number of equal-width bins (by default 10) between the minimum and maximum values encountered in the data, or with the same number of quantile bins, depending on option settings. There are many similarities in Histogram and Frequency processing. Histogram processing may even be thought of as Frequency processing where the data is first ‘bin-coded’. That is, numeric values are first replaced with the histogram bin or bar that they fall into. As with a Frequency analysis, a strategy for efficiently calculating histograms or bins must be determined. One strategy is simply to calculate each histogram individually (i.e., one column at a time). The other strategy is to combine columns into an intermediate table of counts and then select individual column histogram data from the intermediate table. This can enhance performance dramatically in cases where there are not too many combinations of bins and where there are enough rows to make the effort worth while. Too many combined bin values can however lead to greatly degraded performance. The same two parameters used to control the calculation strategy in a Frequency analysis are used here as well. • The minimum number of rows to use the combining strategy with, by default 25000. • The maximum number of possible combined values (in this case, bins) in combined columns, by default 10,000. It is not necessary to order the columns based on number of values because all columns have the same number of potential bins, by default 10. The number of possible combined values is calculated as the running product of the number of bins in successive columns. As many columns are combined as possible without exceeding the parameter for the maximum number of combined values. Any left over single columns are processed individually.

Columns of type DATE are handled by subtracting the date ‘1900-01-01’ from the date and from the minimum and maximum values, so that the calculations are based on the number of days since 1900.

Restart Processing Restart processing is designed to allow users to either restart an interrupted execution of Data Explorer or to add more tables to a completed execution. With certain limitations, it may even be used to add more functions to the analysis. In order to restart an interrupted execution, simply request the Restart option on the OUTPUT tab, along with the same parameters specified for the interrupted execution. Result tables that are not present are created as needed. Any processing that completed is preserved and processing that did not complete is performed newly based on the content of the result tables present. That is, for each function,

262 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Data Explorer processing is repeated only for those columns without answers. An exception to this is restricted frequency processing which is repeated in whole for a given table or not at all. See Data Explorer - Frequency Analysis for more information. In order to add to a previous execution of Data Explorer, the same functions and options should be requested except that new tables (or columns from new tables) should either be added to the request or replace the old tables entirely. If, however, a new function is also requested (as described in Data Explorer - INPUT - Analysis Parameters), the previously selected tables should be included in the request to ensure that the newly requested function is also applied to the originally requested tables. Note that adding new columns to a previously requested table request, or replacing old columns, may lead to errors or unexpected results. As mentioned previously, if a function such as Statistics, Frequency or Histogram was not requested in the previous run, it may be added with any options that are specific to it. For example, if the previous run successfully performed Values processing only, the other functions may be added to the renewed request for Values processing in the continuation run. In this example, the program will determine that the results for Values processing are already present, so they will be preserved and not rebuilt. The Data Explorer analysis is parameterized by specifying the tables and/or columns to analyze as well as the desired analysis parameters, expert options, expert where options and OUTPUT options.

Initiating a Data Explorer Analysis Use the following procedure to initiate a new Data Explorer analysis. 1. Click on the Add New Analysis icon in the toolbar. Add New Analysis from toolbar

2. In the resulting Add New Analysis dialog box, double-click on the Data Explorer icon.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 263 Chapter 4: Descriptive Statistics Data Explorer Add New Analysis: Descriptive Statistics

The Data Explorer dialog box appears, in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. • Data Explorer - INPUT - Data Selection • Data Explorer - INPUT - Analysis Parameters • Data Explorer - INPUT - Expert Options • Data Explorer - INPUT - Expert Where Options • Data Explorer - OUTPUT

Data Explorer - INPUT - Data Selection 1. On the Data Explorer dialog box, click on INPUT. 2. Click on data selection. Data Explorer > Input > Data Selection

3. On this screen, select: • Select Input Source — Select Table, Analysis or MultiTable as the source of input. In the Teradata Profiler product, the Analysis choice does not apply and cannot be selected. If Table is selected as the input source, the user may choose from available databases, tables and columns as described below. • Select Columns From Any Number of Tables ∘ Available Databases — Choose the database from which you will select tables.

264 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Data Explorer ∘ Available Tables — Select the table from which you will select columns. ∘ Available Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window.

Note: Identity columns should not be selected for analysis (i.e., columns defined with the attribute GENERATE … AS IDENTITY).

If Analysis is selected as the input source, the user may choose from available analyses, tables and columns as described below. For more information, see Analysis Input Screen. • Select Columns From Any Number of Tables ∘ Available Analyses — Choose the analysis from which you will select an output table for input. Note that only analyses that create a table or view as output will be available for input to the Data Explorer analysis and not those that merely select data. ∘ Available Tables — Select an output table to be used for input. ∘ Available Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window.

Note: Identity columns should not be selected for analysis (i.e., columns defined with the attribute GENERATE … AS IDENTITY). If MultiTable is selected as the input source, the user may choose from available databases and tables as described below. • Select Tables From Any Number of Databases ∘ Available Databases — Choose the database from which you will select data tables. ∘ Available Tables — Select one or more tables that you wish to analyze in their entirety by highlighting and then either dragging and dropping them into the Selected Tables window, or click on the arrow button to move the highlighted tables into the Selected Tables window.

Note: When the Multiple Tables input option is selected, any columns in the selected tables that are defined with the attribute GENERATED … AS IDENTITY are automatically excluded from analysis.

Data Explorer - INPUT - Analysis Parameters 1. On the Data Explorer dialog box, click on INPUT. 2. Click on analysis parameters. Data Explorer > Input > Analysis Parameters

3. On this screen, select:

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 265 Chapter 4: Descriptive Statistics Data Explorer • Analyses to Perform ∘ Values — Check box to include the Values analysis as part of the Data Explorer analysis execution. ▪ Compute unique values for each column selected — By default, the Data Explorer Values analysis will not calculate the number of unique values within the column specified. Enabling this option adds that calculation to the analysis. Enabling this option is required in order to run a basic “unrestricted” frequency analysis. For more information, see Data Explorer - Frequency Analysis. ∘ Statistics — Check box to include the Statistics analysis as part of the Data Explorer analysis execution. Each of the following basic univariate statistics are individually selectable for the analysis, except if Histogram is selected or Statistics graphs are desired. If either of these is the case, at least the Number of Values, Minimum Value, Maximum Value, Mean Value and Standard Deviation must be selected. By default, these same five calculations are selected when the Statistics option is enabled. The Check All and Clear All buttons can be used to enable or disable all options. See Statistical Analysis for the mathematical equations for each univariate statistic. The following options are available: ▪ Number of Values (required for Statistics graphs) — Include a count of the total number of rows (observations) with values for the specified column. ▪ Minimum Value (required for Statistics graphs) — Include the calculation for the smallest value of the column. ▪ Maximum Value (required for Statistics graphs) — Include the calculation for the largest value of the column. ▪ Mean Value (required for Statistics graphs) — Include the calculation for the average value of the column. ▪ Standard Deviation (required for Statistics graphs) — Include the calculation for the standard deviation of the variable. The standard deviation is a measure of how widely values are dispersed from the average value (the mean). The measures change depending upon if Population or Sample Statistics are chosen. ▪ Skewness — Include the calculation for skewness of the variable. The skewness of the variable is a characterization of the degree of asymmetry of a distribution around its mean. Positive skewness indicates a distribution with an asymmetric tail extending toward more positive values. Negative skewness indicates a distribution with an asymmetric tail extending toward more negative values.

The equation for Skewness changes depending on if Population or Sample Statistics are chosen. Note that skewness is undefined when either the standard deviation of the variable is equal to 0, or the number of occurrences is less than 3. ▪ Kurtosis — Include the calculation for the kurtosis of the variable. The kurtosis of the variable is a characterization of the relative peakedness or flatness of a distribution compared with the normal distribution. Positive kurtosis indicates a relatively peaked distribution. Negative kurtosis indicates a relatively flat distribution.

266 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Data Explorer Note: The measures for Kurtosis (and Skewness) that are provided by Teradata Warehouse Miner are also known as the “Fisher g statistics,” related to the “momental skewness and kurtosis” [D’Agostino, Belanger, and D’Agostino Jr.].

The equation for Kurtosis changes depending on if Population or Sample Statistics are chosen. Note that kurtosis is undefined when either the standard deviation of the variable is equal to 0, or the number of occurrences is less than 4. ▪ Standard Error — Include the calculation for the standard error of the variable. The standard error of the variable, calculated as the standard deviation divided by the square root of the number of occurrences. Different equations for calculating standard error are used depending on if Population Sample Statistics are chosen. ▪ Coefficient of Variance — Include the calculation for the coefficient of variance of the variable. The coefficient of variance of the variable, calculated as 100 times the standard deviation divided by the mean. The equation for coefficient of variance changes depending on if Population or Sample Statistics are chosen. Note that coefficient of variance is undefined when the average of the variable is 0. ▪ Variance — Include the calculation for the variance of the variable. The variance of the variable is calculated as the square of the standard deviation. The equation for Variance changes depending on if Population or Sample Statistics are chosen. ▪ Sum — Include the calculation for the sum of the variable. ▪ Uncorrected Sums of squares — Include the calculation for the uncorrected sums of squares of the variable. ▪ Corrected Sums of squares — Include the calculation for the corrected sums of squares of the variable. ▪ Statistical Method - Population — Use population statistics for the statistical calculations. - Sample — Use sample statistics for those statistical calculations where the calculation changes. ∘ Frequency — Include a Frequency analysis in the execution of the Data Explorer. Note that selecting this option automatically enables a Values analysis. If this option is selected, either the Compute unique values for each column selected option under Values must be selected, or the Restricted Frequency Processing option on the expert options tab must be selected. See the description of the options in Data Explorer - INPUT - Expert Options for an explanation of those parameters that influence the Frequency analysis. ∘ Histogram — Include a quantile or equally distributed Histogram analysis in the execution of the Data Explorer. Note that selecting this option automatically enables a Statistics analysis. See the description of options in Data Explorer - INPUT - Expert Options for an explanation of those parameters that influence the Histogram analysis. ▪ Number of Bins — The total number of quantile or equally distributed bins. Defaults to 10.

Data Explorer - INPUT - Expert Options 1. On the Data Explorer dialog box, click on INPUT. 2. Click on expert options.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 267 Chapter 4: Descriptive Statistics Data Explorer Data Explorer > Input > Expert Options

3. On this screen, select: • Number of Tables to Process in Parallel — The total number of threads to use in order to process tables in parallel. This threading is described above. Defaults to 3 tables. • Maximum Unique Character Values for Unrestricted Frequency Analysis — The maximum number of unique values for character type columns to perform unrestricted frequency analysis on (by default, the value is 100 unique values). Changing this will add to the processing time for the Data Explorer Frequency analysis as a complete frequency will be done against more unique values. If the total number of unique values exceeds the number given here, a restricted frequency is automatically done. See Restricted Frequecy Processing below. • Maximum Unique Numeric/Date Values for Frequency Analysis — The maximum number of unique values for numeric and date type columns to perform a frequency analysis on (by default, the value is 20 unique values). If a numeric or date column has more unique values than this, a histogram is performed instead. • Minimum Rows Before Frequency/Histogram Combining Attempted — The minimum number of rows to use the combining strategy within frequency and histogram analysis. This strategy is defined for both analyses above. Defaults to 25000 rows. Note that less than that has shown no performance improvement when combining columns for those analyses. • Maximum Number of Combined Values for Frequency/Histogram Analysis — The maximum number of possible combined values to allow when combining columns in frequency and histogram analysis. Performance problems and/or SQL errors may result when this is increased. Defaults to 10000 combined values. • Restricted Frequency Processing (Include Prominent Values) — Check box to enable a restricted frequency analysis. Restricted frequency is defined as the minimum percentage of rows a value must occur in for inclusion in results for character columns with more unique values than the specified threshold parameter (as specified by the Maximum Unique Character Values for Unrestricted Frequency Analysis parameter). Defaults to enabled. ∘ Minimum Fraction of Rows Frequency Value Must Occur In — If the ratio of unique values to rows is greater than 100 minus this percentage (100 - 1 = 99%), the restricted frequency analysis is skipped. If not, the restricted frequency analysis is executed. Defaults to 1 percent. • Auto-Calculate the Number of Select List Items — When checked, an attempt is made to determine the number of select list items that should be included in the SQL generated for the Values and Statistics analyses. In some cases, however, the SQL for the Statistics analysis may fail due to too many select list items being generated, dependent on the number of input columns and the Basic Statistics Options requested. In this case, the Auto-Calculate option should be unchecked and a value provided in the Maximum Number... text box below it.

Tip: When processing more than 300 input columns with the first five statistics requested, try setting the maximum items to 1000 or less in the text box below.

268 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Data Explorer ∘ Maximum Number of Select List Items — An integer greater than 0 representing the maximum number of items that will appear in any given SELECT statement generated by the Data Explorer Values and Statistics analyses.

Data Explorer - INPUT - Expert Where Options 1. On the Data Explorer dialog box, click on INPUT. 2. Click on expert where options. Data Explorer > Input > Expert Where Options

One or more optional Where Clauses may be entered on this screen. Each Where Clause entered is applied only to the table currently selected on the screen. 3. On this screen, select: • Select table to associate WHERE clause with — Select the table to associate an optional SQL Where Clause with. • Optional WHERE clause text — Enter the optional SQL Where Clause text to be associated with the selected table, restricting the selected rows. Do not include the word “WHERE” at the beginning of the text. It will be added automatically.

Data Explorer - OUTPUT Before running the analysis, define Output options. 1. On the Data Explorer dialog box, click on OUTPUT: Data Explorer > Output

2. On this screen, select from the following options. • Output Database Name — Name of the database to create the Data Explorer result tables in. By default, this is the Result Database specified in the Connection Properties dialog box. • Values Output Table Name — Text box to specify the name of the Teradata table to persist the Values analysis results in. By default, the Values output table name is TwmExploreValues. • Statistics Output Table Name — Text box to specify the name of the Teradata table to persist the Statistics analysis results in. By default, the Statistics output table name is TwmExploreStatistics. • Frequency Output Table Name — Text box to specify the name of the Teradata table to persist the Frequency analysis results in. By default, the Frequency output table name is TwmExploreFrequency. • Histogram Output Table Name — Text box to specify the name of the Teradata table to persist the Histogram analysis results in. By default, the Histogram output table name is TwmExploreHistogram.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 269 Chapter 4: Descriptive Statistics Data Explorer • Create output table using the FALLBACK keyword — If selected, all output tables will be built with FALLBACK. • Restart — Check box to enable the Restart/Append feature. When enabled, the output tables entered above are analyzed to see which Columns to Analyze still need to be processed. In this case, the answer tables are built only if not present, and, with some exceptions, any entries already present in existing answer tables are skipped during processing. • Stored Procedure — The creation of a stored procedure containing the SQL generated for this analysis can be requested by entering the desired name of the stored procedure here. This will result in the creation of a stored procedure in the user's login database following the execution of the SQL generated by the analysis. For more information, see Stored Procedure Support (Teradata Database). • Procedure Comment — When an optional procedure comment is entered, it is applied to a requested Stored Procedure with an SQL Comment statement. It can be up to 255 characters in length and contain substitution parameters for the output category (Score, ADS, Stats or Other), project name and/or analysis name (using the tags , and , respectively). The default value of this field may be set on the Defaults tab of the Preferences dialog, available from the Tools > Preferences menu option. • Advertise Output — This option “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. For more information, see Advertise Output. • Advertise Note — An advertise note may be specified when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog box. It is a free-form text field of up to 30 characters that may be used to categorize or describe the output.

Running the Data Explorer Analysis After setting parameters on the INPUT and OUTPUT screens as described above, you are ready to run the analysis. 1. To run the analysis, you can either:

• Click the Run icon on the toolbar, or • Select Run on the Project menu, or • Press the F5 key on your keyboard

Results - Data Explorer The results of running the Teradata Warehouse Miner Data Explorer analysis include, the results of executing the generated SQL, a Teradata table, and integrated graphics for each selected analysis. All of these results are outlined below. • Data Explorer - RESULTS - Data • Data Explorer - RESULTS - Graphs • Data Explorer - RESULTS - SQL Click on the RESULTS tab of the Data Explorer analysis when it has completed to view these results.

Data Explorer - RESULTS - Data 1. On the Data Explorer dialog box, click on RESULTS.

270 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Data Explorer 2. Click on data (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed). Data Explorer > Results > Data

Results data, if any, is displayed in a data grid as described in RESULTS Tab. The Data Explorer analysis always creates a table for each descriptive statistical analysis specified. The structure of these output tables are specified below. Those columns in bold below comprise the Unique Primary Index (UPI).

Data Explorer Values Output Table

Name Type Definition xdb VARCHAR(30) Database that contains the table that contains the column to be analyzed. xtbl VARCHAR(30) Table that contains the column to be analyzed. xcol VARCHAR(30) Column (variable) to be analyzed. xtype VARCHAR(30) Data type of this variable. xcnt FLOAT Total number of occurrences of this variable. xnull FLOAT Total number of rows where this variable takes on a null value. xunique FLOAT Total number of rows where this variable takes on a unique value. xblank FLOAT Total number of rows where this variable is blank. xzero FLOAT Total number of rows where this variable is equal to 0. xpos FLOAT Total number of rows where this variable has a positive value. xneg FLOAT Total number of rows where this variable has a negative value.

Data Explorer Statistics Output Table

Name Type Definition xdb VARCHAR(30) Database that contains the table that contains the column to be analyzed. xtbl VARCHAR(30) Table that contains the column to be analyzed. xcol VARCHAR(30) Column (variable) to be analyzed. xcnt FLOAT Total number of occurrences of this variable. xmin FLOAT Minimum value of the variable. Created only if the Minimum Value option is selected.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 271 Chapter 4: Descriptive Statistics Data Explorer

Name Type Definition xmax FLOAT Maximum value of the variable. Created only if the Maximum Value option is selected. xmean FLOAT Arithmetic mean of the variable. Created only if the Mean Value option is selected. xstd FLOAT Standard deviation of the variable. Created only if the Standard Deviation option is selected. xskew FLOAT Skewness of the variable. Created only if the Skewness option is selected. xkurt FLOAT Kurtosis of the variable. Created only if the Kurtosis option is selected. xste FLOAT Standard error of the variable. Created only if the Standard Error option is selected. xcv FLOAT Coefficient of variance of the variable. Created only if the Coefficient of Variance option is selected. xvar FLOAT Variance of the variable. Created only if the Variance option is selected. xsum FLOAT Sum of the variable. Created only if the Sum option is selected. xuss FLOAT Uncorrected sums of squares of the variable. Created only if the Uncorrected Sums of squares option is selected. xcss FLOAT Corrected sums of squares of the variable. Created only if the Corrected Sums of squares option is selected.

Data Explorer Frequency Output Table

Data Explorer Histogram Output Table

Name Type Definition xdb VARCHAR(30) Database that contains the table that contains the column to be analyzed.

272 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Data Explorer

Name Type Definition xtbl VARCHAR(30) Table that contains the column to be analyzed. xcol VARCHAR(30) Column (variable) to be analyzed. xbin INTEGER Integer representing the bin number. xbeg FLOAT Value that represents the beginning boundary for the bin of the column specified. xend FLOAT Value that represents the ending boundary for the bin of the column specified. xcnt FLOAT Number of records in the bin. xpct FLOAT Percentage of total records that this bin represents.

Data Explorer - RESULTS - Graphs 1. On the Data Explorer dialog box, click on RESULTS. 2. Click on graphs (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed). Data Explorer > Results > Graphs

A thumbnail for the Data Explorer Graphics appears on the Graphs tab when the analysis completes. 3. Click on the analysis to launch the Data Explorer Graphics Environment, as shown below.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 273 Chapter 4: Descriptive Statistics Data Explorer Data Explorer Graphics

Options to this environment are discussed in the following sections: • Data Explorer Graph Drill Down Functions • Available Tables and Columns • Selected Tables and Columns • Update Graphs • Values

Data Explorer Graph Drill Down Functions The Data Explorer offers graph drill down from the Zoom Views. The drill down tasks in the Data Explorer are identical to those in the individual Descriptive Stats functions, with the exception of selecting from a data selector grid. Refer to the Values, Frequency, Histogram, and Statistics graph drill down descriptions for details on each of the graph types. These can be found in the following sections: • ∘ Values Graph Drill Down Functions (Values) ∘ Frequency Graph Drill Down Functions (Frequency) ∘ Histogram Graph Drill Down Functions (Histogram) ∘ Statistics Graph Drill Down Functions (Statistics)

Available Tables and Columns All of the tables/columns that were processed by the Data Explorer analysis, but not currently selected for the Graphics Environment. Entire tables and/or specific columns within a table can be selected by

274 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Data Explorer highlighting them and clicking on the right arrow button, or dragged and dropped into the Selected Tables and Columns window.

Selected Tables and Columns All of the tables/columns that were processed by the Data Explorer analysis, and are currently selected for the Graphics Environment. Entire tables and/or specific columns within a table can be deselected by highlighting them and clicking on the left arrow button, or dragged and dropped into the Available Tables and Columns window. The double left arrow button can be used to clear all Selected Tables and Columns, while the up and down arrows can change the order in which specific tables and/or columns are displayed.

Update Graphs Once all desired data is specified within the Selected Tables and Columns window, clicking on the Update Graphs button, loads the data into the graphics environment.

Values The Data Explorer Values analysis is executed against all columns within the Selected Tables and Columns window. For character data, NULL, Unique and Blank values are plotted. For numeric and date data, NULL, Unique, Zero, Positive and Negative values are plotted. By default, the results of the analysis are plotted in bar charts and displayed as thumbnails. The thumbnails can be scrolled by clicking on the double right (>>) or double left (<<) arrow buttons, or displayed on a full screen by clicking on the Full Page >> button. The Full Page mode has additional double right (>>) or double left (<<) arrow buttons for scrolling though the values data. Each individual thumbnail can be maximized for viewing details by double clicking on it. A right click on it, or any individual thumbnail produces a dialog with the following options: • Maximize, Print and Export — The standard options Maximize, Print and Export are described in RESULTS Tab. • Hide Remaining Rows — When Absolute Counts is selected, an option to Hide Remaining Rows may be selected to not display the top (gray) portion of each bar on the graph. (By default, the inverse of the portions of the Values analysis are shown in gray. For example, a table with 1000 rows may have 150 NULL values. The box plot for NULL’s would then show 1000 rows, 850 in gray color and 150 in red). When Hide Remaining Rows is in effect, the option to Show Remaining Rows may be selected to return to the original view. • Relative Counts (Circle Graph) — This option changes the values thumbnails and/or graphics to a different display, showing the relative counts within a circular graph. Toggling between these two views is illustrated below.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 275 Chapter 4: Descriptive Statistics Data Explorer Absolute Counts (Circle Graph)

276 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Data Explorer Relative Counts (Circle Graph)

Each category (i.e., Zero) sums to 1 with its inverse (i.e., Not Zero). The following colors are used to represent each category: ∘ NULL — Red ∘ Unique — Green ∘ Zero — Orange ∘ Positive — Teal ∘ Negative — Purple ∘ Blank — Blue The full page thumbnail view provides an efficient mechanism for quickly identifying data quality problems across many tables. • Frequency — The Data Explorer Frequency analysis is executed against columns within the Selected Tables and Columns window that are character data, or numeric data within the specified thresholds. By default, the results of the analysis are plotted in bar charts and displayed as thumbnails. The thumbnails can be scrolled by clicking on the double right (>>) or double left (<<) arrow buttons, or displayed on a full screen by clicking on the Full Page >> button. The Full Page mode has additional double right (>>) or double left (<<) arrow buttons for scrolling though the values data.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 277 Chapter 4: Descriptive Statistics Data Explorer Each individual thumbnail can be maximized for viewing details by double clicking on it. • Histogram — The Data Explorer Histogram analysis is executed against columns within the Selected Tables and Columns window that are numeric and/or dates outside the specified thresholds. By default, the results of the analysis are plotted in histograms and displayed as thumbnails. The thumbnails have the same properties and modes as the Frequency graphs. • Statistics — The Data Explorer Statistics analysis is executed against columns within the Selected Tables and Columns window that are numeric and/or dates By default, the results of the analysis are plotted in box and whisker plots and displayed as thumbnails. The thumbnails have the same properties and modes as the Frequency and Histogram graphs. Note that the Data Explorer Graphics Environment is always modal; it must be exited before use of Teradata Warehouse Miner can continue.

Data Explorer - RESULTS - SQL 1. On the Data Explorer dialog box, click on RESULTS. 2. Click on SQL (note that the RESULTS tab will be grayed-out/disabled until after the analysis is completed). Data Explorer > Results > SQL

On this screen, the SQL generated for the analysis is returned as text, which can be copied by using the Select All and Copy buttons.

Tutorial - Data Explorer Analysis

Data Explorer - Example #1 1. Parameterize a Data Explorer analysis as follows: • Input Source — MultiTable • Available Databases — The database where the demonstration data was installed. • Available Tables ∘ TWM_CHECKING_ACCT ∘ TWM_CREDIT_ACCT ∘ TWM_CUSTOMER ∘ TWM_SAVINGS_ACCT • Analyses to Perform ∘ Values — Enabled ∘ Compute Unique Values … — Enabled ∘ Statistics — Enabled ∘ Frequency — Enabled

278 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Text Field Analyzer ∘ Histogram — Enabled • Output ∘ Values Analysis Output Table — twm_values ∘ Statistics Analysis Output Table — twm_stats ∘ Frequency Analysis Output Table — twm_freq ∘ Histogram Analysis Output Table — twm_hist 2. Run the analysis. 3. When it completes, click in the RESULTS tab. • Data — The data generated by the Data Explorer analysis is too extensive to show here, but four output tables are generated. ∘ twm_values (42 rows) ∘ twm_stats (26 rows) ∘ twm_freq (198 rows) ∘ twm_hist (193 rows) • Graph — The following is a snapshot of the graph object created for this example. Example Data Explorer Graph

• SQL — The generated SQL is omitted for brevity.

Text Field Analyzer When dealing with character data, it is sometimes helpful to be able to examine this data and determine what actual data type the data could be stored in within the database. The Text Field Analyzer analysis can

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 279 Chapter 4: Descriptive Statistics Text Field Analyzer analyze character data and help distinguish whether the field is a numeric type, a date, a time, a timestamp, or character data. Text field analysis can readily be applied to any type of character data. Non-character data types go unprocessed and are passed along to the output just as they are defined in the input table. Given a table name and the name of a column, the Text Field Analyzer analysis provides a series of tests to distinguish what the correct underlying type should be. 1. The MIN and MAX test is performed on the field, where the MIN and MAX values of a column are retrieved from the database and tested to determine what type the values are. 2. Sample test which retrieves a small sample of data for each column and again accesses what type they should be. 3. Test for fields that have already been determined to be numeric and it tries to classify them in a more specific category if possible. For instance, a field that is considered a FLOAT type after the first two tests might really be a DECIMAL type with 2 decimal places. A date type is validated to make sure all values in that column are truly dates.

Note: For general information about output, see OUTPUT Tab.

Initiating a Text Field Analyzer Analysis Use the following procedure to initiate a new Text Field Analyzer analysis in Teradata Warehouse Miner. 1. Click on the Add New Analysis icon in the toolbar. Add New Analysis from Toolbar

2. In the resulting Add New Analysis dialog box, double-click on the Text Field Analyzer icon.

280 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Text Field Analyzer Add New Analysis: Descriptive Statistics

The Text Field Analyzer dialog appears, in which you will enter INPUT and OUTPUT options to parameterize the analysis as described in the following sections. • Text Field Analyzer - INPUT - Data Selection • Text Field Analyzer - INPUT - Analysis Parameters • Text Field Analyzer - OUTPUT

Text Field Analyzer - INPUT - Data Selection 1. On the Text Field Analyzer dialog box, click on INPUT. 2. Click on data selection. Text Field Analyzer > INPUT > Data Selection

The resulting screen has the following options available: • Available Databases — Choose the database from which you will select data tables. • Available Tables — Select the table from which you will select columns. • Available Columns — Select columns by highlighting and then either dragging and dropping into the Selected Columns window, or click on the arrow button to move highlighted columns into the Selected Columns window.

Text Field Analyzer - INPUT - Analysis Parameters 1. On the Text Field Analyzer dialog box, click on INPUT.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 281 Chapter 4: Descriptive Statistics Text Field Analyzer 2. Click on analysis parameters. Text Field Analyzer > INPUT > Analysis Parameters

The resulting screen has the following options available: • Perform extended numeric analysis — By default, the Text Field Analyzer analysis tests each numeric field by doing a test to determine whether it is Integer, Float, or Decimal. A date field undergoes a test to validate that all values are valid dates. Unchecking this option removes the test and saves an extra pass of the data for each numeric or date field. • Perform extended analysis of Unicode character fields — If Unicode columns are selected as input, they undergo another pass of the data which determines how many values of that column can be represented as regular Latin characters.

Text Field Analyzer - OUTPUT Before executing the analysis, define Output options. 1. On the Text Field Analyzer dialog box, click on OUTPUT. Text Field Analyzer > OUTPUT

The resulting screen has the following options: • Storage ∘ Output Database Name — Text box to specify the name of the Teradata database which will be specified in the SQL that is generated. By default, this is the “Result Database.” ∘ Output Table Name — Text box to specify the name of the Teradata Table that will be added to the SQL.

Running the Text Field Analyzer Analysis After setting parameters on the INPUT and OUTPUT screens as described above, you are ready to run the analysis. 1. To run the analysis, you can either:

• Click the Run icon on the toolbar, or • Select Run on the Project menu, or

282 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 4: Descriptive Statistics Text Field Analyzer • Press the F5 key on your keyboard

Results - Text Field Analyzer The results of running the Text Field Analyzer analysis include the generated SQL itself and a generated matrix showing each column and what the chosen data type is upon completing each test. The NA value inside the matrix means that it is Not Applicable such as for the numeric test when the type is not numeric. The results are discussed in the following sections. • Text Field Analyzer - RESULTS - Reports • Text Field Analyzer - RESULTS - SQL

Note: The RESULTS tab will be grayed-out (disabled) until you have run the analysis.

Text Field Analyzer - RESULTS - Reports 1. On the Text Field Analyzer dialog box, click on RESULTS. 2. Click on reports. Text Field Analyzer > RESULTS > Results

The results of running the Text Field Analyzer analysis include a report on column type information and a report for Unicode Analysis. These are outlined below. • Column Types Matrix — Each column has a row in the matrix showing what the type of the column was throughout its transformation. Non-character data types are at the end of the matrix. • Unicode Analysis — When the Perform extended analysis of Unicode character fields option is selected, a matrix showing all Unicode character columns will be produced. A row in the matrix will contain the column name, how many of its data values can be translated correctly to the Latin character set, and the total number of values that column contains.

Text Field Analyzer - RESULTS - SQL 1. On the Text Field Analyzer dialog box, click on RESULTS. 2. Click on SQL. Text Field Analyzer > RESULTS > SQL

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 283 Chapter 4: Descriptive Statistics Text Field Analyzer On this screen, the generated CREATE TABLE SQL is returned as text, which can be copied by using the Select All and Copy buttons.

284 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 CHAPTER 5 Free Form SQL

Overview The following sections discuss using Free Form SQL with Teradata Warehouse Miner. Free Form SQL provides an interface for directly entering and running SQL commands in Teradata Warehouse Miner.

Initiating the Free Form SQL Analysis This analysis provides an interface for directly entering and running SQL commands, with several facilities provided to allow it to behave like other analyses in various ways. For example, when not using the Teradata Profiler product, a Free Form SQL analysis can be refreshed or published, either as an analysis that builds an analytic data set or as one that builds a score table, with the possibility even of creating a stored procedure from the SQL contained in it. For more information, see Procedure Options for Refresh (Teradata Database). Also, through the use of special substitution parameters and a data selector, other parts of the application can be given insight into the tables and columns used by or created by the otherwise “free-form” analysis. This insight allows such features as column limiting during refresh and referencing the Free Form SQL analysis for input to be provided (again, when not using the Teradata Profiler product). The SQL specified in a Free Form SQL analysis can also be made to appear as a derived table, Table Function or Table Operator in the FROM clause of certain types of analysis by checking the Generate Sql Only option, and possibly the Table Function or Operator option, on the literal parameters tab of the Free Form SQL analysis. To do this, it is also necessary to reference the Free Form SQL analysis for input in an ADS or Reorganization analysis other than Variable Transformation. If a Free Form SQL analysis is used as an Analysis input source in this way, the Free Form SQL analysis must not create an output table or view, but rather, either select data or represent a Table Function or Operator. As when creating an output table or view, it should generally not be necessary however to use the data selection tab to determine the available columns created by a Free Form SQL analysis. This is because the output columns and types are automatically determined, if possible, by a special technique involving the Teradata Data Dictionary. Use the following procedure to initiate Free Form SQL in Teradata Warehouse Miner. 1. Click on the Add New Analysis icon in the toolbar. Add New Analysis from toolbar

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 285 Chapter 5: Free Form SQL Initiating the Free Form SQL Analysis 2. In the resulting Add New Analysis dialog box, click on Miscellaneous under Categories. 3. Under Analyses double-click on Free Form SQL. Add New Analysis: Miscellaneous

The Free Form SQL dialog box appears, in which you will enter INPUT options to parameterize the analysis, described in the following sections. • Free Form SQL - INPUT - Data Selection • Free Form SQL - INPUT - Analysis Parameters • Free Form SQL - INPUT - Literal Parameters

Free Form SQL - INPUT - Analysis Parameters 1. On the Free Form SQL dialog, click on INPUT. 2. Click on analysis parameters. Free Form SQL Input

This screen contains a text window in which one or more SQL statements and optional comments can be typed and/or pasted. When satisfied that the SQL statement or statements are complete, the analysis can be executed, either in whole or in part. To execute just a selected portion of the text, simply highlight the statements to execute as shown in the example above and run the analysis. 3. If desired, comments can be added to the Free Form SQL text in one of two ways: • Multiple line comments can be entered beginning with a forward slash and asterisk “/*” and terminated with an asterisk and forward slash “*/”, but not with the Execute as a single statement

286 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 5: Free Form SQL Initiating the Free Form SQL Analysis option. (If a comment of this type appears after all SQL statements in the text, the option to Submit SQL without comments should be used). • Single line comments can be entered beginning with two consecutive dash characters “--” and terminated with a carriage return. The text window on this screen is resizable and supports and retains formatted text, either through the use of the limited formatting options offered via right-click menu options, or through pasting formatted text from a compatible rich-text format source, such as the Teradata SQL Assistant utility. The following options are available as right-click menu options. • Standard text functions are provided, including Undo, Redo, Cut, Copy, Paste, Delete, Find, Replace and Select All. • The Find/Replace function provides options for Whole word, Case specific, Search from Beginning of text or Current position, as well as Replace and Replace All buttons. • Limited support for text formatting is also provided, including Font, Bold, Italic, Underline and Font Color (with access to a limited number of font colors). • An option is provided to set Parameter Color, changing the font color of all parameter tags to Black, Red, Blue or Green. • If an SQL DROP command fails to find the database object being dropped, the error is ignored and not reported to the user (much as this error is ignored in other types of analysis). The following execution options can be selected at the bottom of the screen. • The Continue execution on error option can be used when multiple statements are entered and it is desired to continue executing statements even if an error occurs on one of them. Note that when this option is used and one or more errors occurs, the last error encountered will be returned as the result of the analysis. Therefore it may also be necessary to select the Tools > Preferences > Execution menu option to Continue project execution on analysis error in order to continue project execution as well. • The option to Submit SQL without comments is provided in case it is desired to simplify viewing the SQL in Results, or in case it is suspected that submitting the comments leads to an error. For example, this option should be used if a comment beginning with “/*” and ending with “*/” follows all SQL statements in the text window. Note that this option is not available when the option to Execute as a single statement is also selected. • If multiple statements have been entered that need to be executed as a single statement (such as with REPLACE MACRO or REPLACE PROCEDURE), use the Execute as a single statement checkbox. Note that when this option is selected, comments beginning with “/*” and ending with “*/” are not allowed. Note also that when this option is used, large decimal data (i.e., more than 18 decimal digits) cannot be returned in the results. Further, with this option, the number of rows returned will not be limited according to user preference settings. • When the Multiple result sets per command option is selected, individual statements will be allowed to return more than one result set. This is particularly useful when a CALL statement returns more than one data set. Note that with this option, the number of rows returned will not be limited according to user preference settings, and large decimal data (more than 18 digits of precision) cannot be returned, just as with the “Execute as a single statement” option.

Free Form SQL - INPUT - Literal Parameters 1. On the Free Form SQL dialog box, click on INPUT.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 287 Chapter 5: Free Form SQL Initiating the Free Form SQL Analysis 2. Click literal parameters. Free Form SQL Input: Literal Parameters

This screen contains information about literal substitution parameters and various advanced options. SQL literal values of string, numeric, date, time and timestamp types may be defined on this screen, with their values being substituted in the free-form SQL on the analysis parameters tab wherever they appear enclosed in less-than and greater-than signs. The substituted literal values automatically include quotes and a keyword if necessary. For example, the parameter p1 shown above appears in the free-form SQL as , which is converted to DATE ‘2008-02-28’. Note that literal values are entered in a format consistent with the workstation’s current locale setting, yet appear in the free-form SQL in an invariant format. In addition to the supported SQL literal types, a text type is provided that receives no formatting and is substituted in the free-form SQL “as-is”. If desired, a special use can be defined for a parameter, such as the uses of Input and Output Database and Input and Output Table shown above. These uses are recognized in other parts of the application as described in more detail below. Note that each special use can be assigned to only one parameter, so that, for example, there can be only one Output Database parameter. Note that Free Form SQL literal parameters are of limited use in the Teradata Profiler product since many of the features described below apply only to the Teradata ADS Generator and Teradata Warehouse Miner products. In particular, all references to Variable Creation, Variable Transformation, Refresh, Publish, ADS and scoring analyses do not apply to the Teradata Profiler product. Note also that in the sample screen shown above, the Creates Score Table and Depends on Analysis options are not visible in the Teradata Profiler. 3. On this screen, select:

288 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 5: Free Form SQL Initiating the Free Form SQL Analysis • The Add option can be used to add a literal parameter to the analysis. A default name consisting of the letter ‘p’ followed by an integer will automatically be assigned. The default name can be changed by the user after clicking the Properties button. A menu with available types is displayed, including: ∘ Date ∘ Numeric ∘ String ∘ Text (used as-is without formatting) ∘ Time ∘ Timestamp Also displayed are the appropriate special use options, depending on the product and the Creates Score Table option. When one of these options is selected a parameter of type Text is added with the appropriate special use (with the exception of Target Date which creates a parameter of type Date). Note that when using a special use text parameter for a database object name, do not include double quotes in the parameter, but rather in the Free Form SQL text surrounding the parameter name if they are needed (for example, “”). • Generally Available Special Use Options The generally available special use options include: ∘ Input Database — When used together with Input Table, the value is recognized when importing or adding an existing analysis. The value specified may not be a foreign database accessed via QueryGrid.

Note: For those parameters that don’t support foreign objects accessed via QueryGrid, a View may be used instead. ∘ Input Table — When used together with Input Database, the value is recognized when importing or adding an existing analysis. The value specified may not be a foreign table accessed via QueryGrid. ∘ Output Database — When used together with Output Table or Output View, the value is recognized when importing, adding an existing analysis, displaying properties or performing metadata maintenance as an output table or view that can be deleted or changed. Another special feature of this parameter use is that when the value is left blank, the default result database from the connection properties is automatically used. If, however, this table or view needs to match the output table or view of another analysis it is better to use an explicit value here so that if database mapping is ever performed upon importing or adding this analysis the database will be matched to the correct value. Finally, the Output Database and Output Table or View parameters must be present if it is desired for the Free Form SQL analysis to be “selectable” (that is, display its output table and columns when selected as an analysis for input). ∘ Output Table — See the description of Output Database above for special considerations. ∘ Output View — See the description of Output Database above for special considerations. ∘ User ID — This special use option can be used for the database in which a volatile table is created. The logon User ID will automatically be substituted at run time, regardless of the value specified. • ADS Special Use Options — In products other than the Teradata Profiler when the Creates Score Table option is not set, the special use options below may be used. With the exception of the generally available special use options already described, these options are of use only when using the Refresh or Publish analyses or the Model Manager application, making the Free Form SQL analysis appear as if it were a Variable Creation analysis with these properties. Depending on the context (i.e., in

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 289 Chapter 5: Free Form SQL Initiating the Free Form SQL Analysis Refresh, Publish or the Model Manager application), these fields may be displayed as separate fields, possibly renamed, as literal parameters or simply used internally, as is the case with User ID. ∘ Target Date — An optional parameter used in time sensitive daa set variables. ∘ Anchor Database — The value specified may not be a foreign database accessed via QueryGrid. ∘ Anchor Table — The value specified may not be a foreign database accessed via QueryGrid. ∘ Input Database — The value specified may not be a foreign database accessed via QueryGrid. ∘ Input Table — The value specified may not be a foreign database accessed via QueryGrid. ∘ Output Database — (defined above) ∘ Output Table — (defined above) ∘ Output View — (defined above) ∘ User ID — (defined above) • Scoring Special Use Options — In products other than the Teradata Profiler when the Creates Score Table option is set, the following special use options may be used. With the exception of the generally available special use options already described, these options are of use only when using the Refresh or Publish analyses or the Model Manager application, making the Free Form SQL analysis appear as if it were a scoring analysis with these properties. Depending on the context (i.e., in Refresh, Publish or the Model Manager application), these fields may be displayed as separate fields, possibly renamed, as literal parameters or simply used internally, as is the case with User ID. ∘ Input Database — The value specified may not be a foreign database accessed via QueryGrid. ∘ Input Table — The value specified may not be a foreign database accessed via QueryGrid. ∘ Output Database — (defined above) ∘ Output Table — (defined above) ∘ Output View — (defined above) ∘ User ID — (defined above) ∘ UDF Database — This special use option can be used for the database in which PMML scoring User Defined Functions are located. The current Metadata Database will automatically be substituted at run time, regardless of the value specified. • The Remove option can be used to remove a literal parameter from the analysis. • The Sort option can be used to sort the literal parameters in the display grid, By Name, By Type or By Entry (i.e., the order entered into the analysis). • The Properties option can be used to change the Name, Value and/or Use of a parameter, as well as to enter a Description if desired. For the Value and Use fields, it provides a more fully-featured alternative to editing them in place in the grid. • The Generate SQL Only option can be used to inhibit the execution of the free-form SQL while generating the SQL (with all parameters substituted) that would ordinarily be executed. The SQL can be viewed on the Results screen. • The Table Function or Operator option can be used to indicate that the Free Form SQL Analysis, when referenced by another analysis, should be placed in the From clause as a Table Function or Table Operator (i.e., without the parentheses that would be added if it represented a derived table). Refer to the similar option on the Function Table Properties dialog box. • The Creates Score Table option (not available in the Teradata Profiler product) can be used to indicate that this analysis should be treated as a scoring analysis when published to the Model Manager application using the Publish analysis. As such, this option can only be set on the last analysis in an analysis reference chain as described below in the description of the Depends on Analysis option. • The Depends on Analysis option (not available in the Teradata Profiler product) can be used to chain this analysis to another analysis that it is dependent on for input, causing the analysis it depends

290 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Chapter 5: Free Form SQL Initiating the Free Form SQL Analysis on to be executed before this one (along with any analyses it may reference). This feature, together with the data selection tab, allows one or more Free Form SQL analyses to appear anywhere in an analysis reference chain. • The Advertise Output option may be requested only when substitution parameters with “special use” Output Database and Output Table or Output View have been specified. This feature “advertises” output by inserting information into one or more of the Advertise Output metadata tables according to the type of analysis and the options selected in the analysis. For more information, see Advertise Output. • An Advertise Note may be specified if desired when the Advertise Output option is selected or when the Always Advertise option is selected on the Connection Properties dialog box. It is a free- form text field of up to 30 characters, it may be used to categorize or describe the output. The Advertise Note is ignored, however, if the appropriate “special use” substitution parameters are not present.

Free Form SQL - INPUT - Data Selection 1. On the Free Form SQL dialog box, click on INPUT. 2. Click data selection. Free Form SQL > Input > Data Select

Note: This panel is not needed in the Teradata Profiler product and is therefore not displayed therein.

A standard column selector form is displayed with instructions stating that this form may be used to indicate the output columns created by this analysis or the input columns (model variables) if it creates a score table. When used to indicate output columns, these columns are used to determine what columns may be selected from this analysis when it is selected as an analysis input source. When used to indicate the input columns of a scoring analysis that is being refreshed or published, the Refresh or Publish analysis can use this information to limit columns in any “selectable” analyses (other than Free Form SQL analyses) referenced by it. If the free-form SQL text specified for this analysis uses “special use” substitution parameters for Output Database and Output Table and builds an analytic data set (and not a score table), the display of this form will use those values to attempt to pre-set the Available Databases, Tables and Columns, provided columns have not been selected thus far. For this reason, it is useful to run the analysis first, creating the output table, and then use this form to declare what the output columns are. As an alternative to using this short-cut, any table containing the data set or model columns can be used to select the appropriate columns. Note that it is only the column names that are used by the Refresh or Publish analysis in limiting the columns in referenced analyses, but the column types are also important when the analysis is referenced for input. Similarly, if the free-form SQL text specified for this analysis uses “special use” literal parameters for Input Database and Input Table and builds a score table (i.e., the Creates Score Table option is selected

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 291 Chapter 5: Free Form SQL Running the Free Form SQL Analysis on the literal parameters tab), the display of this form will use those values to attempt to pre-set the Available Databases, Tables and Columns, provided columns have not been selected thus far. In other words, it is the output of a data set analysis or the input to a scoring analysis that defines the “variables in the model” that are being defined here. Finally, note that even if this tab or panel is not used to specify output columns, it may still be possible to successfully reference a Free Form SQL analysis for input. If the special use parameters for Output Database and Output Table or View are specified, the Teradata data dictionary is queried to get column and index information. If on the other hand the Free Form SQL analysis either selects data or defines a Table Function or Operator, and the Generate Sql Only option (and the Table Function or Operator option if applicable) is checked, the output columns are determined, if possible, using a special technique involving the Teradata data dictionary.

Running the Free Form SQL Analysis After setting parameters on the INPUT screens as described above, you are ready to run the analysis. 1. To run the analysis, you can either:

• Click the Run icon on the toolbar, or • Select Run on the Project menu, or • Press the F5 key on your keyboard

Results - Free Form SQL The results of running the Free Form SQL analysis are available under the RESULTS tab. Click on the RESULTS tab to view the results. Note that the RESULTS tab will be grayed-out/disabled until after the analysis has finished running. Two tabs are available under the RESULTS tab, data and SQL. One or more result sets can be viewed on the data tab depending on the number and nature of the SQL statements specified on the INPUT - analysis parameters tab. The pull-down list can be used to select the desired result table. Similarly, the SQL statements executed or generated can be viewed together or individually on the SQL tab. The pull-down list labeled Show Statement can be used to select either All Statements or individual statements as desired. Note, however, that if the option to Execute as a single statement was selected, individual statements cannot be selected on the Results - SQL tab).

Note: Results will not be returned correctly by the Free Form SQL analysis when a stored procedure is called that contains an output or input/output parameter (i.e., a parameter that returns data).

292 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 APPENDIX A References

References 1. Agrawal, R. Mannila, H. Srikant, R. Toivonen, H. and Verkamo, I., Fast Discovery of Association Rules. In Advances in Knowledge Discovery and Data Mining, 1996, eds. U.M. Fayyad, G. Paitetsky-Shapiro, P. Smyth and R. Uthurusamy. Menlo Park, AAAI Press/The MIT Press. 2. Agresti, A. (1990) Categorical Data Analysis. Wiley, New York. 3. Arabie, P., Hubert, L., and DeSoete, G., Clustering and Classification, World Scientific, 1996. 4. Belsley, D.A., Kuh, E., and Welsch, R.E. (1980) Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley, New York. 5. Bradley, P., Fayyad, U. and Reina, C., Scaling EM Clustering to Large Databases, Microsoft Research Technical Report MSR-TR-98-35, 1998. 6. Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. Classification and Regression Trees. Wadsworth, Belmont, 1984. 7. Conover, W.J. Practical Nonparametric Statistics, 3rd Edition. 8. Cox, D.R. and Hinkley, D.V. (1974) Theoretical Statistics. Chapman & Hall/CRC, New York. 9. D'Agostino, RB. (1971) An omnibus test of normality for moderate and large size samples, Biometrica, 58, 341-348 10. D'Agostino, R. B. and Stephens, M. A., eds. Goodness-of-fit Techniques, 1986. New York: Dekker. 11. D’Agostino, R, Belanger, A., and D’Agostino,R. Jr., A Suggestion for Using Powerful and Informative Tests of Normality, American Statistician, 1990, Vol. 44, No. 4. 12. Finn, J.D. (1974) A General Model for Multivariate Analysis. Holt, Rinehart and Winston, New York. 13. Harman, H.H. (1976) Modern Factor Analysis. University of Chicago Press, Chicago. 14. Harter, H.L. and Owen, D.B., eds, Selected Tables in Mathematical Statistics, Vol. 1.. Providence, Rhode Island: American Mathematical Society. 15. Hosmer, D.W. and Lemeshow, S. (1989) Applied Logistic Regression. Wiley, New York. 16. Jennrich, R.I., and Sampson, P.F. (1966) Rotation For Simple Loadings. Psychometrika, Vol. 31, No. 3. 17. Johnson, R.A. and Wichern, D.W. (1998) Applied Multivariate Statistical Analysis, 4th Edition. Prentice Hall, New Jersey. 18. Kachigan, S.K. (1991) Multivariate Statistical Analysis. Radius Press, New York. 19. Kaiser, Henry F. (1958) The Varimax Criterion For Analytic Rotation In Factor Analysis. Psychometrika, Vol. 23, No. 3. 20. Kass, G. V. (1979) An Exploratory Technique for Investigating Large Quantities of Categorical Data, Applied Statistics (1980) 29, No. 2 pp. 119-127. 21. Kaufman, L. and Rousseeuw, P., Finding Groups in Data, J Wiley & Sons, 1990. 22. Kennedy, W.J. and Gentle, J.E. (1980) Statistical Computing. Marcel Dekker, New York. 23. Kleinbaum, D.G. and Kupper, L.L. (1978) Applied Regression Analysis and Other Multivariable Methods. Duxbury Press, North Scituate, Massachusetts. 24. Maddala, G.S. (1983) Limited-Dependent and Qualitative Variables In Econometrics. Cambridge University Press, Cambridge, United Kingdom. 25. Maindonald, J.H. (1984) Statistical Computation. Wiley, New York.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 293 Appendix A: References References 26. McCullagh, P.M. and Nelder, J.A. (1989) Generalized Linear Models, 2nd Edition. Chapman & Hall/ CRC, New York. 27. McLachlan, G.J. and Krishnan, T., The EM Algorithm and Extensions, J Wiley & Sons, 1997. 28. Menard, S (1995) Applied Logistic Regression Analysis, Sage, Thousand Oaks. 29. Mulaik, S.A. (1972) The Foundations of Factor Analysis. McGraw-Hill, New York. 30. Neter, J., Kutner, M.H., Nachtsheim, C.J., and Wasserman, W. (1996) Applied Linear Statistical Models, 4th Edition. WCB/McGraw-Hill, New York. 31. NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/, 2005. 32. Nocedal, J. and Wright, S.J. (1999) Numerical Optimization. Springer-Verlag, New York. 33. Orchestrate/OSH Component User’s Guide Vol II, Analytics Library, Chapter 2: Introduction to Data Mining. Torrent Systems, Inc., 1997. 34. Ordonez, C. and Cereghini, P. (2000) SQLEM: Fast Clustering in SQL using the EM Algorithm. SIGMOD Conference 2000: 559-570. 35. Ordonez, C. (2004): Programming the K-means clustering algorithm in SQL. KDD 2004: 823-828. 36. Ordonez, C. (2004): Horizontal aggregations for building tabular data sets. DMKD 2004: 35-42. 37. Pagano, Gauvreau Principles of Biostatistics, 2nd Edition. 38. Peduzzi, P.N., Hardy, R.J., and Holford, T.R. (1980) A Stepwise Variable Selection Procedure for Nonlinear Regression Models. Biometrics 36, 511-516. 39. Pregibon, D. (1981) Logistic Regression Diagnostics. Annals of Statistics, Vol. 9, No. 4, 705-724. 40. PROPHET StatGuide, BBN Corporation, 1996. 41. Quinlan, J.R. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, 1993. 42. Roweis, S. and Ghahramani, Z., A Unifying Review of Linear Gaussian Models, Journal of Neural Computation, 1999. 43. Royston, JP., An Extension of Shapiro and Wilk’s W Test for Normality to Large Samples, Applied Statistics, 1982, 31, No. 2, pp.115-124. 44. Royston, JP, Algorithm AS 177: Expected normal order statistics (exact and approximate), 1982, Applied Statistics, 31, 161-165. 45. Royston, JP., Algorithm AS 181: The W Test for Normality, 1982, Applied Statistics, 31, 176-180. 46. Royston, JP., A Remark on Algorithm AS 181: The W Test for Normality, 1995, Applied Statistics, 44, 547-551. 47. Rubin, Donald B., and Thayer, Dorothy T. (1982) EM Algorithms For ML Factor Analysis. Psychometrika, Vol. 47, No. 1. 48. Shapiro, SS and Francia, RS (1972). An approximate analysis of variance test for normality, Journal of the American Statistical Association, 67, 215-216. 49. SPSS 7.5 Statistical Algorithms Manual, SPSS Inc., Chicago. 50. SYSTAT 9: Statistics I. (1999) SPSS Inc., Chicago. 51. Takahashi, T. (2005) Getting Started: International Character Sets and the Teradata Database, Teradata Corporation, 541-0004068-C02. 52. Tatsuoka, M.M. (1971) Multivariate Analysis: Techniques For Educational and Psychological Research. Wiley, New York. 53. Tatsuoka, M.M. (1974) Selected Topics in Advanced Statistics, Classification Procedures, Institute for Personality and Ability Testing, 1974. 54. Teradata Database SQL Functions, Operators, Expressions, and Predicates Release, B035-1145. 55. Teradata Warehouse Miner Model Manager User Guide, B035-2303. 56. Teradata Warehouse Miner Release Definition, B035-2494. 57. Teradata Warehouse Miner User Guide, Volume 1 Introduction and Profiling, B035-2300.

294 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 Appendix A: References References 58. Teradata Warehouse Miner User Guide, Volume 2 ADS Generation, B035-2301. 59. Teradata Warehouse Miner User Guide, Volume 3 Analytic Functions, B035-2302. 60. Wendorf, Craig A., MANUALS FOR UNIVARIATE AND MULTIVARIATE STATISTICS © 1997, Revised 2004-03-12, 2005. 61. Wilkinson, L., Blank, G., and Gruber, C. (1996) Desktop Data Analysis With SYSTAT. Prentice Hall, New Jersey.

Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4 295 Appendix A: References References

296 Teradata Warehouse Miner User Guide - Volume 1 Introduction and Profiling, Release 5.4.4