Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data

Real-Time Analytics Techniques to Analyze and Visualize Streaming Data Byron Ellis ffi rs.indd 11:51:43:AM 06/06/2014 Page i Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data Published by John Wiley & Sons, Inc. 10475 Crosspoint Boulevard Indianapolis, IN 46256 www.wiley.com Copyright © 2014 by John Wiley & Sons, Inc., Indianapolis, Indiana Published simultaneously in Canada ISBN: 978-1-118-83791-7 ISBN: 978-1-118-83793-1 (ebk) ISBN: 978-1-118-83802-0 (ebk) Manufactured in the United States of America 10 9 8 7 6 5 4 3 2 1 No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley. com/go/permissions. Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifi cally disclaim all warranties, including without limitation warranties of fi tness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or website may provide or recommendations it may make. Further, readers should be aware that Internet websites listed in this work may have changed or disappeared between when this work was written and when it is read. For general information on our other products and services please contact our Customer Care Department within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com. Library of Congress Control Number: 2014935749 Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affi liates, in the United States and other countries, and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book. ffi rs.indd 11:51:43:AM 06/06/2014 Page ii As always, for Natasha. ffi rs.indd 11:51:43:AM 06/06/2014 Page iii Credits Executive Editor Marketing Manager Robert Elliott Carrie Sherrill Project Editor Business Manager Kelly Talbot Amy Knies Technical Editors Vice President and Executive Luke Hornof Group Publisher Ben Peirce Richard Swadley Jose Quinteiro Associate Publisher Production Editors Jim Minatel Christine Mugnolo Project Coordinator, Cover Daniel Scribner Todd Klemme Copy Editor Proofreader Charlotte Kugen Nancy Carrasco Manager of Content Development Indexer and Assembly John Sleeva Mary Beth Wakefi eld Cover Designer Director of Community Marketing Wiley David Mayhew iv ffi rs.indd 11:51:43:AM 06/06/2014 Page iv About the Author Byron Ellis is the CTO of Spongecell, an advertising technology fi rm based in New York, with offi ces in San Francisco, Chicago, and London. He is responsible for research and development as well as the maintenance of Spongecell’s com- puting infrastructure. Prior to joining Spongecell, he was Chief Data Scientist for Liveperson, a leading provider of online engagement technology. He also held a variety of positions at adBrite, Inc, one of the world’s largest advertising exchanges at the time. Additionally, he has a PhD in Statistics from Harvard where he studied methods for learning the structure of networks from experi- mental data obtained from high throughput biology experiments. About the Technical Editors With 20 years of technology experience, Jose Quinteiro has been an integral part of the design and development of a signifi cant number of end-user, enter- prise, and Web software systems and applications. He has extensive experience with the full stack of Web technologies, including both front-end and back-end design and implementation. Jose earned a B.S. in Chemistry from The College of William & Mary. Luke Hornof has a Ph.D. in Computer Science and has been part of several suc- cessful high-tech startups. His research in programming languages has resulted in more than a dozen peer-reviewed publications. He has also developed com- mercial software for the microprocessor, advertising, and music industries. His current interests include using analytics to improve web and mobile applications. Ben Peirce manages research and infrastructure at Spongecell, an advertising technology company. Prior to joining Spongecell, he worked in a variety of roles in healthcare technology startups, he and co-founded SET Media, an ad tech company focusing on video. He holds a PhD from Harvard University’s School of Engineering and Applied Sciences, where he studied control systems and robotics. v ffi rs.indd 11:51:43:AM 06/06/2014 Page v Acknowledgments Before writing a book, whenever I would see “there are too many people to thank” in the acknowledgements section it would seem cliché. It turns out that it is not so much cliché as a simple statement of fact. There really are more people to thank than could reasonably be put into print. If nothing else, including them all would make the book really hard to hold. However, there are a few people I would like to specifi cally thank for their contributions, knowing and unknowing, to the book. The fi rst, of course, is Robert Elliot at Wiley who seemed to think that a presentation that he had liked could possibly be a book of nearly 400 pages. Without him, this book simply wouldn’t exist. I would also like to thank Justin Langseth, who was not able to join me in writing this book but was my co-presenter at the talk that started the ball rolling. Hopefully, we will get a chance to reprise that experience. I would also like to thank my editors Charlotte, Rick, Jose, Luke, and Ben, led by Kelly Talbot, who helped fi nd and correct my many mistakes and kept the project on the straight and narrow. Any mistakes that may be left, you can be assured, are my fault. For less obvious contributions, I would like to thank all of the DDG regulars. At least half, probably more like 80%, of the software in this book is here as a direct result of a conversation I had with someone over a beer. Not too shabby for an informal, loose-knit gathering. Thanks to Mike for fi rst inviting me along and to Matt and Zack for hosting so many events over the years. Finally, I’d like to thank my colleagues over the years. You all deserve some sort of medal for putting up with my various harebrained schemes and tinker- ing. An especially big shout-out goes to the adBrite crew. We built a lot of cool stuff that I know for a fact is still cutting edge. Caroline Moon gets a big thank vii ffi rs.indd 11:51:43:AM 06/06/2014 Page vii viii Acknowledgments you for not being too concerned when her analytics folks wandered off and started playing with this newfangled “Hadoop” thing and started collecting more servers than she knew we had. I’d also especially like to thank Daniel Issen and Vadim Geshel. We didn’t always see eye-to-eye (and probably still don’t), but what you see in this book is here in large part to arguments I’ve had with those two. ffi rs.indd 11:51:43:AM 06/06/2014 Page viii Contents Introduction xv Chapter 1 Introduction to Streaming Data 1 Sources of Streaming Data 2 Operational Monitoring 3 Web Analytics 3 Online Advertising 4 Social Media 5 Mobile Data and the Internet of Things 5 Why Streaming Data Is Different 7 Always On, Always Flowing 7 Loosely Structured 8 High-Cardinality Storage 9 Infrastructures and Algorithms 10 Conclusion 10 Part I Streaming Analytics Architecture 13 Chapter 2 Designing Real-Time Streaming Architectures 15 Real-Time Architecture Components 16 Collection 16 Data Flow 17 Processing 19 Storage 20 Delivery 22 Features of a Real-Time Architecture 24 High Availability 24 Low Latency 25 Horizontal Scalability 26 ix ftoc.indd 05:30:33:PM 06/12/2014 Page ix x Contents Languages for Real-Time Programming 27 Java 27 Scala and Clojure 28 JavaScript 29 The Go Language 30 A Real-Time Architecture Checklist 30 Collection

Load more