Managing Data in Motion This Page Intentionally Left Blank Managing Data in Motion Data Integration Best Practice Techniques and Technologies

Managing Data in Motion This page intentionally left blank Managing Data in Motion Data Integration Best Practice Techniques and Technologies April Reeve AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Morgan Kaufmann is an imprint of Elsevier Acquiring Editor: Andrea Dierna Development Editor: Heather Scherer Project Manager: Mohanambal Natarajan Designer: Russell Purdy Morgan Kaufmann is an imprint of Elsevier 225 Wyman Street, Waltham, MA 02451, USA Copyright r 2013 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods or professional practices, may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information or methods described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data Application submitted British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-397167-8 For information on all MK publications visit our website at www.mkp.com Printed in the USA 131415161710987654321 For my sons Henry who knows everything and, although he hasn’t figured out exactly what I do for a living, advised me to “put words on paper” and David who is so talented, so much fun to be with, and always willing to go with me to Disney. This page intentionally left blank Contents Foreword..................................................................................................................xv Acknowledgements............................................................................................... xvii Biography................................................................................................................xix Introduction.............................................................................................................xxi PART 1 INTRODUCTION TO DATA INTEGRATION Chapter 1 The Importance of Data Integration ............................ 3 The natural complexity of data interfaces ........................................3 The rise of purchased vendor packages ............................................4 Key enablement of big data and virtualization.................................5 Chapter 2 What Is Data Integration? .......................................... 7 Data in motion ...................................................................................7 Integrating into a common format—transforming data....................7 Migrating data from one system to another......................................8 Moving data around the organization ...............................................9 Pulling information from unstructured data....................................11 Moving process to data ...................................................................12 Chapter 3 Types and Complexity of Data Integration..................15 The differences and similarities in managing data in motion and persistent data ...........................................................................15 Batch data integration......................................................................16 Real-time data integration ...............................................................16 Big data integration .........................................................................17 Data virtualization ...........................................................................17 Chapter 4 The Process of Data Integration Development ...........................................................19 The data integration development life cycle...................................19 Inclusion of business knowledge and expertise..............................20 PART 2 BATCH DATA INTEGRATION Chapter 5 Introduction to Batch Data Integration.......................25 What is batch data integration?.......................................................25 Batch data integration life cycle .....................................................26 vii viii Contents Chapter 6 Extract, Transform, and Load ....................................29 What is ETL?...................................................................................29 Profiling ...........................................................................................30 Extract..............................................................................................30 Staging .............................................................................................31 Access layers ...................................................................................32 Transform.........................................................................................33 Simple mapping ..........................................................................33 Lookups.......................................................................................33 Aggregation and normalization ..................................................33 Calculation ..................................................................................34 Load .................................................................................................34 Chapter 7 Data Warehousing ...................................................37 What is data warehousing?..............................................................37 Layers in an enterprise data warehouse architecture......................38 Operational application layer .....................................................38 External data ...............................................................................38 Data staging areas coming into a data warehouse .....................39 Data warehouse data structure....................................................40 Staging from data warehouse to data mart or business intelligence ...................................................................40 Business Intelligence Layer........................................................40 Types of data to load in a data warehouse .....................................41 Master data in a data warehouse ................................................41 Balance and snapshot data in a data warehouse ........................42 Transactional data in a data warehouse .....................................43 Events..........................................................................................43 Reconciliation .............................................................................43 Interview with an expert: Krish Krishnan on data warehousing and data integration............................................44 Chapter 8 Data Conversion ......................................................51 What is data conversion? ................................................................51 Data conversion life cycle...............................................................51 Data conversion analysis .................................................................52 Best practice data loading ...............................................................52 Improving source data quality.........................................................53 Contents ix Mapping to target ..........................................................................53 Configuration data .........................................................................54 Testing and dependencies..............................................................55 Private data ....................................................................................55 Proving...........................................................................................56 Environments .................................................................................56 Chapter 9 Data Archiving .......................................................59 What is data archiving?.................................................................59 Selecting data to archive ...............................................................60 Can the archived data be retrieved?..............................................60 Conforming data structures in the archiving environment ...........61 Flexible data structures..................................................................61

Managing Data in Motion This Page Intentionally Left Blank Managing Data in Motion Data Integration Best Practice Techniques and Technologies

Data Mapping: Deﬁnition, Tools, and Techniques

Enterprise Information Integration: Successes, Challenges and Controversies

Scientific Data Mining in Astronomy

Data Quality Through Data Integration: How Integrating Your IDEA Data Will Help Improve Data Quality

Preview MS Access Tutorial (PDF Version)

Informatica Cloud Data Integration

Oracle Warehouse Builder Concepts Guide

What to Look for When Selecting a Master Data Management Solution What to Look for When Selecting a Master Data Management Solution

Using ETL, EAI, and EII Tools to Create an Integrated Enterprise

Transforming MARCXML Records Using XSLT

Declarative Access to Filesystem Data New Application Domains for XML Database Management Systems

X3ML Mapping Framework for Information Integration in Cultural Heritage and Beyond