Building Open Source ETL Solutions with Pentaho Data Integration

Pentaho® Kettle Solutions Pentaho® Kettle Solutions Building Open Source ETL Solutions with Pentaho Data Integration Matt Casters Roland Bouman Jos van Dongen Pentaho® Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration Published by Wiley Publishing, Inc. 10475 Crosspoint Boulevard Indianapolis, IN 46256 lll#lâZn#Xdb Copyright © 2010 by Wiley Publishing, Inc., Indianapolis, Indiana Published simultaneously in Canada ISBN: 978-0-470-63517-9 ISBN: 9780470942420 (ebk) ISBN: 9780470947524 (ebk) ISBN: 9780470947531 (ebk) Manufactured in the United States of America 10 9 8 7 6 5 4 3 2 1 No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appro- priate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at ]iie/$$lll#lâZn#Xdb$\d$eZgb^hh^dch. Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representa- tions or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Web site may provide or recommenda- tions it may make. Further, readers should be aware that Internet Web sites listed in this work may have changed or disappeared between when this work was written and when it is read. For general information on our other products and services please contact our Customer Care Department within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Library of Congress Control Number: 2010932421 Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. Pentaho is a registered trademark of Pentaho, Inc. All other trademarks are the property of their respective owners. Wiley Publishing, Inc. is not associated with any product or vendor mentioned in this book. For my wife and kids, Kathleen, Sam and Hannelore. Your love and joy keeps me sane in crazy times. —Matt For my wife, Annemarie, and my children, David, Roos, Anne and Maarten. Thanks for bearing with me—I love you! —Roland For my children Thomas and Lisa, and for Yvonne, to whom I owe more than words can express. —Jos About the Authors Matt Casters has been an independent business intelligence consultant for many years and has implemented numerous data warehouses and BI solutions for large companies. For the last 8 years, Matt kept himself busy with the development of an ETL tool called Kettle. This tool was open sourced in December 2005 and acquired by Pentaho early in 2006. Since then, Matt took up the position of Chief Data Integration at Pentaho. His responsibility is to continue to be lead developer for Kettle. Matt tries to help the Kettle community in any way possible; he answers questions on the forum and speaks occasion- ally at conferences all around the world. He has a blog at ]iie/$$lll#^Wg^Y\Z#WZ and you can follow his 5bViiXVhiZgh account on Twitter. Roland Bouman has been working in the IT industry since 1998 and is currently working as a web and business intelligence developer. Over the years he has focused on open source software, in particular database technology, business intelligence, and web development frameworks. He’s an active member of the MySQL and Pentaho communities, and a regular speaker at international conferences, such as the MySQL User Conference, OSCON and at Pentaho community events. Roland co-authored the MySQL 5.1. Cluster Certification Guide and Pentaho Solutions, and was a technical reviewer for a number of MySQL and Pentaho related book titles. He maintains a technical blog at ]iie/$$geWdjbVc#Wad\hedi#Xdb and tweets as 5gdaVcYWdjbVc on Twitter. Jos van Dongen is a seasoned business intelligence professional and well-known author and presenter. He has been involved in software development, business intelligence, and data warehousing since 1991. Before starting his own consulting practice, Tholis Consulting, in 1998, he worked for a top tier systems integrator and a leading management consulting firm. Over the past years, he has successfully implemented BI and data warehouse solutions for a variety of organizations, both commercial and non-profit. Jos covers new BI developments for the Dutch Database Magazine and speaks regularly at national and international conferences. He authored one book on open source BI and is co-author of the book Pentaho Solutions. You can find more information about Jos on ]iie/$$lll#i]da^h #Xdb or follow 5_dhkVcYdc\Zc on Twitter. vii Credits Executive Editor Marketing Manager Robert Elliott Ashley Zurcher Project Editor Production Manager Sara Shlaer Tim Tate Technical Editors Vice President and Executive Group Jens Bleuel Publisher Sven Boden Richard Swadley Kasper de Graaf Vice President and Executive Publisher Daniel Einspanjer Barry Pruett Nick Goodman Mark Hall Associate Publisher Samatar Hassan Jim Minatel Benjamin Kallmann Project Coordinator, Cover Bryan Senseman Lynsey Stanford Johannes van den Bosch Compositor Production Editor Maureen Forys, Daniel Scribner Happenstance Type-O-Rama Copy Editor Proofreader Nancy Rapoport Nancy Bell Editorial Director Indexer Robyn B. Siesky Robert Swanson Editorial Manager Cover Designer Mary Beth Wakefield Ryan Sneed viii Acknowledgments This book is the result of the efforts of many individuals. By convention, authors receive explicit credit, and get to have their names printed on the book cover. But creating this book would not have been possible without a lot of hard work behind the scenes. We, the authors, would like to express our gratitude to a number of people that provided substantial contributions, and thus help define and shape the final result that is Pentaho Kettle Solutions. First, we’d like to thank those individuals that contributed directly to the material that appears in the book: N Ingo Klose suggested an elegant solution to generate keys starting from a given offset within a single transformation (this solution is discussed in Chapter 8, “Handling Dimension Tables,” subsection “Generating Surrogate Keys Based on a Counter,” shown in Figure 8-2). N Samatar Hassan provided text as well as working example transformations to demonstrate Kettle’s RSS capabilities. Samatar’s contribution is included almost completely and appears in the RSS section of Chapter 21, “Web Services.” N Thanks to Mike Hillyer and the MySQL documentation team for creating and main- taining the Sakila sample database, which is introduced in Chapter 4 and appears in many examples throughout this book. N Although only three authors appear on the cover, there was actually a fourth one: We cannot thank Kasper de Graaf of DIKW-Academy enough for writing the Data Vault chapter, which has benefited greatly from his deep expertise on this subject. Special thanks also to Johannes van den Bosch who did a great job reviewing Kasper’s work and gave another boost to the overall quality and clarity of the chapter. N Thanks to Bernd Aschauer and Robert Wintner, both from Aschauer EDV (]iie/$$lll#VhX]VjZg"ZYk#Vi$Zc), for providing the examples and screen- shots used in the section dedicated to SAP of Chapter 6, “Data Extraction.” N Daniel Einspanjer of the Mozilla Foundation provided sample transformations for Chapter 7, “Cleansing and Conforming.” ix x Acknowledgments Thanks for your contributions. This book benefited substantially from your efforts. Much gratitude goes out to all of our technical reviewers. Providing a good technical review is hard and time-consuming, and we have been very lucky to find a collection of such talented and seasoned Pentaho and Kettle experts willing to find some time in their busy schedules to provide us with the kind of quality review required to write a book of this size and scope. We’d like to thank the Kettle and Pentaho communities. During and before the writing of this book, individuals from these communities provided valuable suggestions and ideas to all three authors for topics to cover in a book that focuses on ETL, data integration, and Kettle. We hope this book will be useful and practical for everybody who is using or planning to use Kettle. Whether we succeeded is up to the reader, but if we did, we have to thank individuals in the Kettle and Pentaho communities for helping us achieve it.

Building Open Source ETL Solutions with Pentaho Data Integration

Star Schema Modeling with Pentaho Data Integration

Data Vault and 'The Truth' About the Enterprise Data Warehouse

Table of Contents

The Network Data Storage Model Based on the HTML

Projects – Other Than Hadoop! Created By:-Samarjit Mahapatra [email protected]

Database Software Market: Billy Fitzsimmons +1 312 364 5112

Basically Speaking, Inmon Professes the Snowflake Schema While Kimball Relies on the Star Schema

MASTER's THESIS Role of Metadata in the Datawarehousing Environment

Kimball Vs. Inmon

Big Data: a Survey the New Paradigms, Methodologies and Tools

Big Data Solutions and Applications

Educational Open Government Data: from Requirements to End Users