PRINCIPLES of BIG DATA Intentionally Left As Blank PRINCIPLES of BIG DATA Preparing, Sharing, and Analyzing Complex Information
Total Page:16
File Type:pdf, Size:1020Kb
PRINCIPLES OF BIG DATA Intentionally left as blank PRINCIPLES OF BIG DATA Preparing, Sharing, and Analyzing Complex Information JULES J. BERMAN, Ph.D., M.D. AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Morgan Kaufmann is an imprint of Elsevier Acquiring Editor: Andrea Dierna Editorial Project Manager: Heather Scherer Project Manager: Punithavathy Govindaradjane Designer: Russell Purdy Morgan Kaufmann is an imprint of Elsevier 225 Wyman Street, Waltham, MA 02451, USA Copyright # 2013 Elsevier Inc. All rights reserved No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods or professional practices, may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information or methods described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data Berman, Jules J. Principles of big data : preparing, sharing, and analyzing complex information / Jules J Berman. pages cm ISBN 978-0-12-404576-7 1. Big data. 2. Database management. I. Title. QA76.9.D32B47 2013 005.74–dc23 2013006421 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Printed and bound in the United States of America 131415161710987654321 For information on all MK publications visit our website at www.mkp.com Dedication To my father, Benjamin v Intentionally left as blank Contents Acknowledgments xi 4. Introspection Author Biography xiii Background 49 Preface xv Knowledge of Self 50 Introduction xix eXtensible Markup Language 52 Introduction to Meaning 54 Namespaces and the Aggregation of Meaningful 1. Providing Structure to Unstructured Assertions 55 Data Resource Description Framework Triples 56 Reflection 59 Background 1 Use Case: Trusted Time Stamp 59 Machine Translation 2 Summary 60 Autocoding 4 Indexing 9 Term Extraction 11 5. Data Integration and Software Interoperability 2. Identification, Deidentification, Background 63 and Reidentification The Committee to Survey Standards 64 Standard Trajectory 65 Background 15 Specifications and Standards 69 Features of an Identifier System 17 Versioning 71 Registered Unique Object Identifiers 18 Compliance Issues 73 Really Bad Identifier Methods 22 Interfaces to Big Data Resources 74 Embedding Information in an Identifier: Not Recommended 24 One-Way Hashes 25 6. Immutability and Immortality Use Case: Hospital Registration 26 Deidentification 28 Background 77 Data Scrubbing 30 Immutability and Identifiers 78 Reidentification 31 Data Objects 80 Lessons Learned 32 Legacy Data 82 Data Born from Data 83 3. Ontologies and Semantics Reconciling Identifiers across Institutions 84 Zero-Knowledge Reconciliation 86 Background 35 The Curator’s Burden 87 Classifications, the Simplest of Ontologies 36 Ontologies, Classes with Multiple Parents 39 7. Measurement Choosing a Class Model 40 Introduction to Resource Description Framework Background 89 Schema 44 Counting 90 Common Pitfalls in Ontology Development 46 Gene Counting 93 vii viii CONTENTS Dealing with Negations 93 Step 2. Resource Evaluation 158 Understanding Your Control 95 Step 3. A Question Is Reformulated 159 Practical Significance of Measurements 96 Step 4. Query Output Adequacy 160 Obsessive-Compulsive Disorder: The Mark of a Great Step 5. Data Description 161 Data Manager 97 Step 6. Data Reduction 161 Step 7. Algorithms Are Selected, If Absolutely 8. Simple but Powerful Big Data Techniques Necessary 162 Step 8. Results Are Reviewed and Conclusions Background 99 Are Asserted 164 Look at the Data 100 Step 9. Conclusions Are Examined and Subjected Data Range 110 to Validation 164 Denominator 112 Frequency Distributions 115 Mean and Standard Deviation 119 12. Failure Estimation-Only Analyses 122 Background 167 Use Case: Watching Data Trends with Google Failure Is Common 168 Ngrams 123 Failed Standards 169 Use Case: Estimating Movie Preferences 126 Complexity 172 When Does Complexity Help? 173 9. Analysis When Redundancy Fails 174 Save Money; Don’t Protect Harmless Background 129 Information 176 Analytic Tasks 130 After Failure 177 Clustering, Classifying, Recommending, and Use Case: Cancer Biomedical Informatics Grid, Modeling 130 a Bridge Too Far 178 Data Reduction 134 Normalizing and Adjusting Data 137 Big Data Software: Speed and Scalability 139 13. Legalities Find Relationships, Not Similarities 141 Background 183 10. Special Considerations in Big Data Responsibility for the Accuracy and Legitimacy of Analysis Contained Data 184 Rights to Create, Use, and Share the Resource 185 Background 145 Copyright and Patent Infringements Incurred by Theory in Search of Data 146 Using Standards 187 Data in Search of a Theory 146 Protections for Individuals 188 Overfitting 148 Consent 190 Bigness Bias 148 Unconsented Data 194 Too Much Data 151 Good Policies Are a Good Policy 197 Fixing Data 152 Use Case: The Havasupai Story 198 Data Subsets in Big Data: Neither Additive nor Transitive 153 14. Societal Issues Additional Big Data Pitfalls 154 Background 201 11. Stepwise Approach to Big Data How Big Data Is Perceived 201 Analysis The Necessity of Data Sharing, Even When It Seems Irrelevant 204 Background 157 Reducing Costs and Increasing Productivity with Step 1. A Question Is Formulated 158 Big Data 208 CONTENTS ix Public Mistrust 210 Glossary 229 Saving Us from Ourselves 211 References 247 Hubris and Hyperbole 213 Index 257 15. The Future Background 217 Last Words 226 Intentionally left as blank Acknowledgments I thank Roger Day, and Paul Lewis who res- Denise Penrose, who worked on her very last olutely poured through the entire manuscript, day at Elsevier to find this title a suitable home placing insightful and useful comments in at Elsevier’s Morgan Kaufmann imprint. I every chapter. I thank Stuart Kramer, whose thank Andrea Dierna, Heather Scherer, and valuable suggestions for the content and orga- all the staff at Morgan Kaufmann who nization of the text came when the project was shepherded this book through the publication in its formative stage. Special thanks go to and marketing processes. xi Intentionally left as blank Author Biography Jules Berman holds two Bachelor of Science held joint appointments at the University of degrees from MIT (Mathematics, and Earth Maryland Medical Center and at the Johns and Planetary Sciences), a Ph.D. from Temple Hopkins Medical Institutions. In 1998, he University, and anM.D.fromthe University of became the Program Director for Pathology Miami. He was a graduate researcher in the Informatics in the Cancer Diagnosis Program Fels Cancer Research Institute at Temple Uni- at the U.S. National Cancer Institute, where versity and at the American Health Founda- he worked and consulted on Big Data projects. tion in Valhalla, New York. His postdoctoral In 2006, Dr. Berman was President of the Asso- studies were completed at the U.S. National ciation for Pathology Informatics. In 2011, he Institutes of Health, and his residency was received the Lifetime Achievement Award completed at the George Washington Univer- from the Association for Pathology Informat- sity Medical Center in Washington, DC. ics. He is a coauthor on hundreds of scientific Dr. Berman served as Chief of Anatomic publications. Today, Dr. Berman is a freelance Pathology, Surgical Pathology and Cytopa- author, writing extensively in his three areas thology at the Veterans Administration Medi- of expertise: informatics, computer program- cal Center in Baltimore, Maryland, where he ming, and pathology. xiii Intentionally left as blank Preface We can’t solve problems by using the same value. The primary purpose of this book is kind of thinking we used when we created to explain the principles upon which serious them. Albert Einstein Big Data resources are built. All of the data held in Big Data resources must have a form Data pours into millions of computers ev- that supports search, retrieval, and analysis. ery moment of every day. It is estimated that The analytic methods must be available for the total accumulated data stored on com- review, and the analytic results must be puters worldwide is about 300 exabytes (that’s available for validation. 300 billion gigabytes). Data storage increases Perhaps