Data Mining with Microsoft SQL Server 2008 / Jamie Maclennan, Bogdan Crivat, Zhaohui Tang
Total Page:16
File Type:pdf, Size:1020Kb
Maclennan ffirs.tex V3 - 10/04/2008 3:27am Page ii Maclennan ffirs.tex V3 - 10/04/2008 3:27am Page i Data Mining with Microsoft SQL Server2008 Maclennan ffirs.tex V3 - 10/04/2008 3:27am Page ii Maclennan ffirs.tex V3 - 10/04/2008 3:27am Page iii Data Mining with Microsoft SQL Server2008 Jamie MacLennan ZhaoHui Tang Bogdan Crivat Wiley Publishing, Inc. Maclennan ffirs.tex V3 - 10/04/2008 3:27am Page iv Data Mining with MicrosoftSQL Server2008 Published by Wiley Publishing, Inc. 10475 Crosspoint Boulevard Indianapolis, IN 46256 www.wiley.com Copyright 2009 by Wiley Publishing, Inc., Indianapolis, Indiana Published by Wiley Publishing, Inc., Indianapolis, Indiana Published simultaneously in Canada ISBN: 978-0-470-27774-4 Manufactured in the United States of America 10987654321 No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN 46256, (317) 572-3447, fax (317) 572-4355, or online at www.wiley.com/go/permissions. Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Web site may provide or recommendations it may make. Further, readers should be aware that Internet Web sites listed in this work may have changed or disappeared between when this work was written and when it is read. For general information on our other products and services please contact our Customer Care Department within the U.S. at (800) 762-2974, outside the United States at (317) 572-3993, or fax (317) 572-4002. Library of Congress Cataloging-in-Publication Data MacLennan, Jamie. Data mining with Microsoft SQL server 2008 / Jamie MacLennan, Bogdan Crivat, ZhaoHui Tang. p. cm. Includes index. ISBN 978-0-470-27774-4 (paper/website) 1. SQL server. 2. Data mining. I. Crivat, Bogdan. II. Tang, Zhaohui. III. Title. QA76.9.D343M335 2008 005.75 85 — dc22 2008035467 Trademarks: Wileyand the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. Microsoft and SQL Server are registered trademarks of Microsoft Corporation in the United States and/or other countries. All other trademarks are the property of their respective owners. Wiley Publishing, Inc. is not associated with any product or vendor mentioned in this book. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Maclennan ffirs.tex V3 - 10/04/2008 3:27am Page v To Logan, because he needs it the most. — Jamie MacLennan This book is for Cosmin, with great hope that he will someday find math (and data mining) to be fun and interesting. — Bogdan Crivat Maclennan ffirs.tex V3 - 10/04/2008 3:27am Page vi Maclennan f01.tex V2 - 10/04/2008 3:30am Page vii About the Authors Jamie MacLennan is the principal development manager of SQL Server Analy- sis Services at Microsoft. In addition to being responsible for the development and delivery of the Data Mining and OLAP technologies for SQL Server, MacLennan is a proud husband and father of four. He has more than 25 patents and patents pending for his work on SQL Server Data Mining. MacLennan has written extensively on the data mining technology in SQL Server, includ- ing many articles in MSDN Magazine, SQL Server Magazine, and postings on SQLServerDataMining.com and his blog at http://blogs.msdn.com/jamiemac. This is his second edition of Data Mining with SQL Server. MacLennan has been a featured and invited speaker at conferences worldwide, including Microsoft TechEd, Microsoft TechEd Europe, SQL PASS, the Knowledge Discovery and Data Mining (KDD) conference, the Americas Conference on Information Systems (AMCIS), and the Data Mining Cup conference. ZhaoHui Tang is a group program manager at Microsoft adCenter Labs, where he manages a number of research projects related to paid search and content ads. He is the inventor of Microsoft Keyword Services Platform. Prior to adCenter, he spent six years as a lead program manager in the SQL Server Business Intelligence (BI) group, mainly focusing on data mining develop- ment. He has written numerous articles for both academic and industrial publications, such as The VLDB Journal and SQL Server Magazine.Heisa frequent speaker at business intelligence conferences. He was also a co-author of the previous edition of this book, Data Mining with SQL Server 2005. Bogdan Crivat is a senior software design engineer in SQL Server Analy- sis Services at Microsoft, working primarily on the Data Mining platform. vii Maclennan f01.tex V2 - 10/04/2008 3:30am Page viii viii About the Authors Crivat has written various articles on data mining for MSDN Magazine and Access/VB/SQL Advisor Magazine,aswellasnumerouspostingsonthe SQLServerDataMining.com website and on the MSDN Forums. He presented at various Microsoft and data mining professional conferences. Crivat also blogs about SQL Server Data Mining at www.bogdancrivat.net/dm. Maclennan f02.tex V2 - 10/04/2008 3:31am Page ix Credits Executive Editor Vice President and Executive Robert Elliott Group Publisher Richard Swadley Development Editor Kevin Shafer Vice President and Executive Publisher Technical Editors Joseph B. Wikert Raman Iyer; Shuvro Mitra Project Coordinator, Cover Production Editor Lynsey Stanford Dassi Zeidel Proofreader Copy Editor Publication Services, Inc. Kathryn Duggan Indexer Editorial Manager Ted Laux Mary Beth Wakefield Cover Image Production Manager Darren Greenwood/Design Pics/ Tim Tate Corbis ix Maclennan f02.tex V2 - 10/04/2008 3:31am Page x Maclennan f03.tex V2 - 10/04/2008 3:31am Page xi Acknowledgments First of all we would like to acknowledge the help from our data mining team members and other colleagues in the Microsoft SQL Server Business Intelligence (BI) organization. In addition to creating the best data mining package on the planet, most of them gave up some of their free time to review the text and sample code. Direct thanks go to Shuvro Mitra, Raman Iyer, Dana Cristofor, Jeanine Nelson-Takaki, and Niketan Pansare for helping review our text to ensure that it makes sense and that our samples work. Thanks also to the rest of the data mining team, including Donald Farmer, Tatyana Yakushev, Yimin Wu, Fernando Godinez Delgado, Gang Xiao, Liu Tang, and Bo Simmons for building such a great product. In addition, we would like to thank the SQL BI management of Kamal Hathi and Tom Casey for supporting data mining in SQL Server. SQL Server 2008 Data Mining (including the Data Mining Add-Ins) is a product jointly developed by the SQL Server Analysis Services team and other teams inside Microsoft. We would like to thank colleagues from Excel — notably Rob Collie, Howie Dickerman, and Dan Battagin, whose valuable input into the design of the Data Mining Add-Ins guaranteed their success. Also thanks to those in the Machine Learning and Applied Statistics (MLAS) Group, headed by Research Manager David Heckerman, who continue to advise us on deep algorithmic issues in our product. We would like to thank David Heckerman, Jesper Lind, Alexei Bocharov, Chris Meek, Bo Thiesson, and Max Chickering for their contributions. We would like to give special thanks to Kevin Shafer for his close editing of our text, which has greatly improved the quality of this manuscript. Also thanks to Wiley Publications acquisitions editor Bob Elliot for his support and patience. xi Maclennan f03.tex V2 - 10/04/2008 3:31am Page xii xii Acknowledgments Special thanks from Jamie to his wife, April, who yet again supported him through the ups and downs of authoring a book, particularly during painful rewrites and recaptures of screen shots, while taking care of our kids and the world around me. Elalu, honey. Bogdan would like to thank his wife, Irinel, for supporting him, reviewing his chapters, and some really helpful hints for capturing screen shots. Maclennan cag.tex V2 - 10/04/2008 3:54am Page xiii Contents at a Glance Foreword xxix Introduction xxxi Chapter 1 Introduction to Data Mining in SQL Server 2008 1 Chapter 2 Applied Data Mining