Algorithms of the Intelligent Web
Total Page:16
File Type:pdf, Size:1020Kb
Haralambos Marmanis Dmitry Babenko MANNING Algorithms of the Intelligent Web Licensed to Deborah Christiansen <[email protected]> Licensed to Deborah Christiansen <[email protected]> Algorithms of the Intelligent Web HARALAMBOS MARMANIS DMITRY BABENKO MANNING Greenwich (74° w. long.) Licensed to Deborah Christiansen <[email protected]> For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. Sound View Court 3B fax: (609) 877-8256 Greenwich, CT 06830 email: [email protected] ©2009 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15% recycled and processed without the use of elemental chlorine. Development Editor: Jeff Bleiel Manning Publications Co. Copyeditor: Benjamin Berg Sound View Court 3B Typesetter: Gordan Salinovic Greenwich, CT 06830 Cover designer: Leslie Haimes ISBN 978-1-933988-66-5 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – MAL – 14 13 12 11 10 09 Licensed to Deborah Christiansen <[email protected]> brief contents 1 ■ What is the intelligent web? 1 2 ■ Searching 21 3 ■ Creating suggestions and recommendations 69 4 ■ Clustering: grouping things together 121 5 ■ Classification: placing things where they belong 164 6 ■ Combining classifiers 232 7 ■ Putting it all together: an intelligent news portal 278 Appendix A Introduction to BeanShell 317 B Web crawling 319 C Mathematical refresher 323 D Natural language processing 327 E Neural networks 330 v Licensed to Deborah Christiansen <[email protected]> Licensed to Deborah Christiansen <[email protected]> contents preface xiii acknowledgments xvi about this book xviii What is the intelligent web? 1 1 1.1 Examples of intelligent web applications 3 1.2 Basic elements of intelligent applications 4 1.3 What applications can benefit from intelligence? 6 Social networking sites 6 ■ Mashups 7 ■ Portals 8 ■ Wikis 9 Media-sharing sites 9 ■ Online gaming 10 1.4 How can I build intelligence in my own application? 11 Examine your functionality and your data 11 ■ Get more data from the web 12 1.5 Machine learning, data mining, and all that 15 1.6 Eight fallacies of intelligent applications 16 Fallacy #1: Your data is reliable 17 ■ Fallacy #2: Inference happens instantaneously 18 ■ Fallacy #3: The size of data doesn’t matter 18 Fallacy #4: Scalability of the solution isn’t an issue 18 ■ Fallacy #5: Apply the same good library everywhere 18 ■ Fallacy #6: The computation time is known 19 ■ Fallacy #7: Complicated models are better 19 ■ Fallacy #8: There are models without bias 19 vii Licensed to Deborah Christiansen <[email protected]> viii CONTENTS 1.7 Summary 19 1.8 References 20 Searching 21 2 2.1 Searching with Lucene 22 Understanding the Lucene code 24 ■ Understanding the basic stages of search 29 2.2 Why search beyond indexing? 32 2.3 Improving search results based on link analysis 33 An introduction to PageRank 34 ■ Calculating the PageRank vector 35 alpha: The effect of teleportation between web pages 38 ■ Understanding the power method 38 ■ Combining the index scores and the PageRank scores 43 2.4 Improving search results based on user clicks 45 A first look at user clicks 46 ■ Using the NaiveBayes classifier 48 Combining Lucene indexing, PageRank, and user clicks 51 2.5 Ranking Word, PDF, and other documents without links 55 An introduction to DocRank 55 ■ The inner workings of DocRank 57 2.6 Large-scale implementation issues 61 2.7 Is what you got what you want? Precision and recall 64 2.8 Summary 65 2.9 To do 66 2.10 References 68 Creating suggestions and recommendations 69 3 3.1 An online music store: the basic concepts 70 The concepts of distance and similarity 71 ■ A closer look at the calculation of similarity 76 ■ Which is the best similarity formula? 79 3.2 How do recommendation engines work? 80 Recommendations based on similar users 80 ■ Recommendations based on similar items 89 ■ Recommendations based on content 92 3.3 Recommending friends, articles, and news stories 99 Introducing MyDiggSpace.com 99 ■ Finding friends 100 ■ The inner workings of DiggDelphi 102 3.4 Recommending movies on a site such as Netflix.com 107 An introduction of movie datasets and recommenders 107 ■ Data normalization and correlation coefficients 110 3.5 Large-scale implementation and evaluation issues 115 Licensed to Deborah Christiansen <[email protected]> CONTENTS ix 3.6 Summary 117 3.7 To Do 117 3.8 References 119 Clustering: grouping things together 121 4 4.1 The need for clustering 122 User groups on a website: a case study 123 ■ Finding groups with a SQL order by clause 124 ■ Finding groups with array sorting 125 4.2 An overview of clustering algorithms 128 Clustering algorithms based on cluster structure 129 ■ Clustering algorithms based on data type and structure 130 ■ Clustering algorithms based on data size 131 4.3 Link-based algorithms 132 The dendrogram: a basic clustering data structure 132 ■ A first look at link-based algorithms 134 ■ The single-link algorithm 135 ■ The average-link algorithm 137 ■ The minimum-spanning-tree algorithm 139 4.4 The k-means algorithm 142 A first look at the k-means algorithm 142 ■ The inner workings of k- means 143 4.5 Robust Clustering Using Links (ROCK) 146 Introducing ROCK 146 ■ Why does ROCK rock? 147 4.6 DBSCAN 151 A first look at density-based algorithms 151 ■ The inner workings of DBSCAN 153 4.7 Clustering issues in very large datasets 157 Computational complexity 157 ■ High dimensionality 158 4.8 Summary 160 4.9 To Do 161 4.10 References 162 Classification: placing things where they belong 164 5 5.1 The need for classification 165 5.2 An overview of classifiers 169 Structural classification algorithms 170 ■ Statistical classification algorithms 172 ■ The lifecycle of a classifier 173 5.3 Automatic categorization of emails and spam filtering 174 NaïveBayes classification 175 ■ Rule-based classification 188 Licensed to Deborah Christiansen <[email protected]> x CONTENTS 5.4 Fraud detection with neural networks 199 A use case of fraud detection in transactional data 199 ■ Neural networks overview 201 ■ A neural network fraud detector at work 203 The anatomy of the fraud detector neural network 208 ■ A base class for building general neural networks 214 5.5 Are your results credible? 219 5.6 Classification with very large datasets 223 5.7 Summary 225 5.8 To do 226 5.9 References 230 Classification schemes 230 ■ Books and articles 230 Combining classifiers 232 6 6.1 Credit worthiness: a case study for combining classifiers 234 A brief description of the data 235 ■ Generating artificial data for real problems 239 6.2 Credit evaluation with a single classifier 243 The naïve Bayes baseline 243 ■ The decision tree baseline 245 ■ The neural network baseline 247 6.3 Comparing multiple classifiers on the same data 250 McNemar’s test 251 ■ The difference of proportions test 253 Cochran’s Q test and the F test 255 6.4 Bagging: bootstrap aggregating 257 The bagging classifier at work 258 ■ A look under the hood of the bagging classifier 260 ■ Classifier ensembles 263 6.5 Boosting: an iterative improvement approach 265 The boosting classifier at work 266 ■ A look under the hood of the boosting classifier 268 6.6 Summary 272 6.7 To Do 273 6.8 References 277 Putting it all together: an intelligent news portal 278 7 7.1 An overview of the functionality 280 7.2 Getting and cleansing content 281 Get set. Get ready. Crawl the Web! 281 ■ Review of the search prerequi- sites 282 ■ A default set of retrieved and processed news stories 284 Licensed to Deborah Christiansen <[email protected]> CONTENTS xi 7.3 Searching for news stories 286 7.4 Assigning news categories 288 Order matters! 289 ■ Classifying with the NewsProcessor class 294 Meet the classifier 295 ■ Classification strategy: going beyond low- level assignments 297 7.5 Building news groups with the NewsProcessor class 300 Clustering general news stories 301 ■ Clustering news stories within a news category 305 7.6 Dynamic content based on the user’s ratings 308 7.7 Summary 311 7.8 To do 312 7.9 References 316 appendix A Introduction to BeanShell 317 appendix B Web crawling 319 appendix C Mathematical refresher 323 appendix D Natural language processing 327 appendix E Neural networks 330 index 333 Licensed to Deborah Christiansen <[email protected]> Licensed to Deborah Christiansen <[email protected]> preface During my graduate school years I became acquainted with the field of machine learn- ing, and in particular the field of pattern recognition. The focus of my work was on mathematical modeling and numerical simulations, but the ability to recognize pat- terns in a large volume of data had obvious applications in many fields. The years that followed brought me closer to the subject of machine learning than I ever imagined.