Infosphere Datastage Parallel Framework Standard Practices

Front cover InfoSphere DataStage Parallel Framework Standard Practices Develop highly efficient and scalable information integration applications Investigate, design, and develop data flow jobs Get guidelines for cost effective performance Julius Lerm Paul Christensen ibm.com/redbooks International Technical Support Organization InfoSphere DataStage: Parallel Framework Standard Practices September 2010 SG24-7830-00 Note: Before using this information and the product it supports, read the information in “Notices” on page xiii. First Edition (September 2010) This edition applies to Version 8, Release 1 of IBM InfoSphere Information Server (5724-Q36) and Version 9, Release 0, Modification 1 of IBM InfoSphere Master Data Management Server (5724-V51), and Version 5.3.2 of RDP. © Copyright International Business Machines Corporation 2010. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Notices . xiii Trademarks . xiv Preface . xv The team who wrote this book . xvii Now you can become a published author, too! . xix Comments welcome. xix Stay connected to IBM Redbooks . xx Chapter 1. Data integration with Information Server and DataStage . 1 1.1 Information Server 8 . 3 1.1.1 Architecture and information tiers . 3 1.2 IBM Information Management InfoSphere Services . 5 1.3 Center of Excellence for Data Integration (CEDI) . 6 1.4 Workshops for IBM InfoSphere DataStage . 10 Chapter 2. Data integration overview . 11 2.1 Job sequences . 13 2.2 Job types . 14 2.2.1 Transformation jobs . 15 2.2.2 Hybrid jobs . 18 2.2.3 Provisioning jobs . 19 Chapter 3. Standards . 21 3.1 Directory structures . 22 3.1.1 Metadata layer . 22 3.1.2 Data, install, and project directory structures . 23 3.1.3 Extending the DataStage project for external entities . 24 3.1.4 File staging . 29 3.2 Naming conventions . 32 3.2.1 Key attributes of the naming convention . 32 3.2.2 Designer object layout. 34 3.2.3 Documentation and metadata capture . 34 3.2.4 Naming conventions by object type . 35 3.3 Documentation and annotation . 47 3.4 Working with source code control systems . 50 3.4.1 Source code control standards . 50 3.4.2 Using object categorization standards . 51 3.4.3 Export to source code control system . 51 © Copyright IBM Corp. 2010. All rights reserved. iii Chapter 4. Job parameter and environment variable management . 55 4.1 DataStage environment variables . 57 4.1.1 DataStage environment variable scope . 57 4.1.2 Special values for DataStage environment variables . 58 4.1.3 Environment variable settings . 58 4.1.4 Migrating project-level environment variables. 60 4.2 DataStage job parameters . 60 4.2.1 When to use parameters. 61 4.2.2 Parameter standard practices . 62 4.2.3 Specifying default parameter values . 62 4.2.4 Parameter sets . 63 Chapter 5. Development guidelines. 69 5.1 Modular development . 70 5.2 Establishing job boundaries . 70 5.3 Job design templates . 72 5.4 Default job design . 72 5.5 Parallel shared containers. 73 5.6 Error and reject record handling . 74 5.6.1 Reject handling with the Sequential File stage . 75 5.6.2 Reject handling with the Lookup stage . 76 5.6.3 Reject handling with the Transformer stage . 77 5.6.4 Reject handling with Target Database stages. 78 5.6.5 Error processing requirements . 79 5.7 Component usage . 85 5.7.1 Server Edition components . 85 5.7.2 Copy stage . 85 5.7.3 Parallel datasets . 86 5.7.4 Parallel Transformer stages . 86 5.7.5 BuildOp stages . 87 5.8 Job design considerations for usage and impact analysis . 87 5.8.1 Maintaining JobDesign:Table definition connection . 88 5.8.2 Verifying the job design:table definition connection . 89 Chapter 6. Partitioning and collecting. 91 6.1 Partition types . 93 6.1.1 Auto partitioning . 94 6.1.2 Keyless partitioning . 95 6.1.3 Keyed partitioning . 98 6.1.4 Hash partitioning . 99 6.2 Monitoring partitions . 103 6.3 Partition methodology . 106 6.4 Partitioning examples . 108 iv InfoSphere DataStage: Parallel Framework Standard Practices 6.4.1 Partitioning example 1: Optimized partitioning . 108 6.4.2 Partitioning example 2: Use of Entire partitioning . 109 6.5 Collector types. 111 6.5.1 Auto collector . 111 6.5.2 Round-robin collector . 111 6.5.3 Ordered collector. 112 6.5.4 Sort Merge collector . 112 6.6 Collecting methodology . 113 Chapter 7. Sorting . 115 7.1 Partition and sort keys. 117 7.2 Complete (Total) sort. 119 7.3 Link sort and Sort stage . 119 7.3.1 Link sort. 120 7.3.2 Sort stage . 121 7.4 Stable sort . ..

Infosphere Datastage Parallel Framework Standard Practices

Online Identity in the Case of the Share Phenomenon. a Glimpse Into the on Lives of Romanian Millennials

Storming the Reality Studio

Data Management in Systems Biology I

Four Modes of Travelling and Navigating the Knowledge Universe?

Implementing an Infosphere Optim Data Growth Solution

Dynamic Information with IBM Infosphere Data Replication CDC

Annual Meeting with the Financial Market

A Metalogue with Floridi's Information Ethics

APA Newsletter on Philosophy and Computers a Basic Cognitive Cycle, Including Several Modes of Learning, 08:2

De-Identified Multidimensional Medical Records for Disease Population Demographics and Image Processing Tools Development

A Review of Data Mining Using Big Data in Health Informatics Matthew Herland, Taghi M Khoshgoftaar and Randall Wald*

Troubleshooting.Pdf