Meta-Algorithmics Patterns for Robust, Low Cost, High Quality Systems
Total Page:16
File Type:pdf, Size:1020Kb
STEVEN J. SIMSKE META-ALGORITHMICS PATTERNS FOR ROBUST, LOW COST, HIGH QUALITY SYSTEMS META-ALGORITHMICS META-ALGORITHMICS PATTERNS FOR ROBUST, LOW-COST, HIGH-QUALITY SYSTEMS Steven J. Simske HP Labs, Colorado, USA C 2013 John Wiley & Sons, Ltd Registered office John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom For details of our global editorial offices, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com. The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. It is sold on the understanding that the publisher is not engaged in rendering professional services and neither the publisher nor the author shall be liable for damages arising herefrom. If professional advice or other expert assistance is required, the services of a competent professional should be sought. Library of Congress Cataloging-in-Publication Data Simske, Steven J. Meta-algorithmics : patterns for robust, low-cost, high-quality systems / Dr. Steven J. Simske, Hewlett-Packard Labs. pages cm ISBN 978-1-118-34336-4 (hardback) 1. Computer algorithms. 2. Parallel algorithms. 3. Heuristic programming. 4. Computer systems–Costs. 5. Computer systems–Quality control. I. Title. QA76.9.A43S543 2013 005.1–dc23 2013004488 A catalogue record for this book is available from the British Library. ISBN: 9781118343364 Typeset in 10/12pt Times by Aptara Inc., New Delhi, India Contents Acknowledgments xi 1 Introduction and Overview 1 1.1 Introduction 1 1.2 Why Is This Book Important? 2 1.3 Organization of the Book 3 1.4 Informatics 4 1.5 Ensemble Learning 6 1.6 Machine Learning/Intelligence 7 1.6.1 Regression and Entropy 8 1.6.2 SVMs and Kernels 9 1.6.3 Probability 15 1.6.4 Unsupervised Learning 17 1.6.5 Dimensionality Reduction 18 1.6.6 Optimization and Search 20 1.7 Artificial Intelligence 22 1.7.1 Neural Networks 22 1.7.2 Genetic Algorithms 25 1.7.3 Markov Models 28 1.8 Data Mining/Knowledge Discovery 31 1.9 Classification 32 1.10 Recognition 38 1.11 System-Based Analysis 39 1.12 Summary 39 References 40 2 Parallel Forms of Parallelism 42 2.1 Introduction 42 2.2 Parallelism by Task 43 2.2.1 Definition 43 2.2.2 Application to Algorithms and Architectures 46 2.2.3 Application to Scheduling 51 2.3 Parallelism by Component 52 2.3.1 Definition and Extension to Parallel-Conditional Processing 52 vi Contents 2.3.2 Application to Data Mining, Search, and Other Algorithms 55 2.3.3 Application to Software Development 59 2.4 Parallelism by Meta-algorithm 64 2.4.1 Meta-algorithmics and Algorithms 66 2.4.2 Meta-algorithmics and Systems 67 2.4.3 Meta-algorithmics and Parallel Processing 68 2.4.4 Meta-algorithmics and Data Collection 69 2.4.5 Meta-algorithmics and Software Development 70 2.5 Summary 71 References 72 3 Domain Areas: Where Are These Relevant? 73 3.1 Introduction 73 3.2 Overview of the Domains 74 3.3 Primary Domains 75 3.3.1 Document Understanding 75 3.3.2 Image Understanding 77 3.3.3 Biometrics 78 3.3.4 Security Printing 79 3.4 Secondary Domains 86 3.4.1 Image Segmentation 86 3.4.2 Speech Recognition 90 3.4.3 Medical Signal Processing 90 3.4.4 Medical Imaging 92 3.4.5 Natural Language Processing 95 3.4.6 Surveillance 97 3.4.7 Optical Character Recognition 98 3.4.8 Security Analytics 101 3.5 Summary 101 References 102 4 Applications of Parallelism by Task 104 4.1 Introduction 104 4.2 Primary Domains 105 4.2.1 Document Understanding 112 4.2.2 Image Understanding 118 4.2.3 Biometrics 126 4.2.4 Security Printing 131 4.3 Summary 135 References 136 5 Application of Parallelism by Component 137 5.1 Introduction 137 5.2 Primary Domains 138 5.2.1 Document Understanding 138 5.2.2 Image Understanding 152 Contents vii 5.2.3 Biometrics 162 5.2.4 Security Printing 170 5.3 Summary 172 References 173 6 Introduction to Meta-algorithmics 175 6.1 Introduction 175 6.2 First-Order Meta-algorithmics 178 6.2.1 Sequential Try 178 6.2.2 Constrained Substitute 181 6.2.3 Voting and Weighted Voting 184 6.2.4 Predictive Selection 189 6.2.5 Tessellation and Recombination 192 6.3 Second-Order Meta-algorithmics 195 6.3.1 Confusion Matrix and Weighted Confusion Matrix 195 6.3.2 Confusion Matrix with Output Space Transformation (Probability Space Transformation) 199 6.3.3 Tessellation and Recombination with Expert Decisioner 203 6.3.4 Predictive Selection with Secondary Engines 206 6.3.5 Single Engine with Required Precision 208 6.3.6 Majority Voting or Weighted Confusion Matrix 209 6.3.7 Majority Voting or Best Engine 210 6.3.8 Best Engine with Differential Confidence or Second Best Engine 212 6.3.9 Best Engine with Absolute Confidence or Weighted Confusion Matrix 217 6.4 Third-Order Meta-algorithmics 218 6.4.1 Feedback 219 6.4.2 Proof by Task Completion 221 6.4.3 Confusion Matrix for Feedback 224 6.4.4 Expert Feedback 228 6.4.5 Sensitivity Analysis 232 6.4.6 Regional Optimization (Extended Predictive Selection) 236 6.4.7 Generalized Hybridization 239 6.5 Summary 240 References 240 7 First-Order Meta-algorithmics and Their Applications 241 7.1 Introduction 241 7.2 First-Order Meta-algorithmics and the “Black Box” 241 7.3 Primary Domains 242 7.3.1 Document Understanding 242 7.3.2 Image Understanding 246 7.3.3 Biometrics 252 7.3.4 Security Printing 256 7.4 Secondary Domains 257 7.4.1 Medical Signal Processing 258 viii Contents 7.4.2 Medical Imaging 264 7.4.3 Natural Language Processing 268 7.5 Summary 271 References 271 8 Second-Order Meta-algorithmics and Their Applications 272 8.1 Introduction 272 8.2 Second-Order Meta-algorithmics and Targeting the “Fringes” 273 8.3 Primary Domains 279 8.3.1 Document Understanding 280 8.3.2 Image Understanding 293 8.3.3 Biometrics 297 8.3.4 Security Printing 299 8.4 Secondary Domains 304 8.4.1 Image Segmentation 305 8.4.2 Speech Recognition 307 8.5 Summary 308 References 308 9 Third-Order Meta-algorithmics and Their Applications 310 9.1 Introduction 310 9.2 Third-Order Meta-algorithmic Patterns 311 9.2.1 Examples Covered 311 9.2.2 Training-Gap-Targeted Feedback 311 9.3 Primary Domains 313 9.3.1 Document Understanding 313 9.3.2 Image Understanding 315 9.3.3 Biometrics 318 9.3.4 Security Printing 323 9.4 Secondary Domains 328 9.4.1 Surveillance 328 9.4.2 Optical Character Recognition 334 9.4.3 Security Analytics 337 9.5 Summary 340 References 341 10 Building More Robust Systems 342 10.1 Introduction 342 10.2 Summarization 342 10.2.1 Ground Truthing for Meta-algorithmics 342 10.2.2 Meta-algorithmics for Keyword Generation 347 10.3 Cloud Systems 350 10.4 Mobile Systems 353 10.5 Scheduling 355 10.6 Classification 356 10.7 Summary 358 Reference 359 Contents ix 11 The Future 360 11.1 Recapitulation 360 11.2 The Pattern of All Patience 362 11.3 Beyond the Pale 365 11.4 Coming Soon 367 11.5 Summary 368 References 368 Index 369 Acknowledgments The goals of this book were ambitious—perhaps too ambitious—both in breadth (domains ad- dressed) and depth (number and variety of parallel processing and meta-algorithmic patterns). The book represents, or at least builds on, the work of many previous engineers, scientists and knowledge workers. Undoubtedly most, if not all, of the approaches in this book have been elaborated elsewhere, either overtly or in some disguise. One contribution of this book is to bring these disparate approaches together in one place, systematizing the design of in- telligent parallel systems. In progressing from the design of parallel systems using traditional by-component and by-task approaches to meta-algorithmic parallelism, the advantages of hybridization for system accuracy, robustness and cost were shown. I have a lot of people to thank for making this book a reality. First and foremost, I’d like to thank my wonderful family—Tess, Kieran and Dallen—for putting up with a year’s worth of weekends and late nights spent designing and running the many “throw away” experiments necessary to illustrate the application of meta-algorithmics (not to mention that little thing of actually writing).