Spark for Python Developers
Total Page:16
File Type:pdf, Size:1020Kb
www.it-ebooks.info Spark for Python Developers A concise guide to implementing Spark big data analytics for Python developers and building a real-time and insightful trend tracker data-intensive app Amit Nandi BIRMINGHAM - MUMBAI www.it-ebooks.info Spark for Python Developers Copyright © 2015 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: December 2015 Production reference: 1171215 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78439-969-6 www.packtpub.com www.it-ebooks.info Credits Author Project Coordinator Amit Nandi Suzanne Coutinho Reviewers Proofreader Manuel Ignacio Franco Galeano Safs Editing Rahul Kavale Daniel Lemire Indexer Priya Sane Chet Mancini Laurence Welch Graphics Kirk D'Penha Commissioning Editor Amarabha Banerjee Production Coordinator Shantanu N. Zagade Acquisition Editor Sonali Vernekar Cover Work Shantanu N. Zagade Content Development Editor Merint Thomas Mathew Technical Editor Naveenkumar Jain Copy Editor Roshni Banerjee www.it-ebooks.info About the Author Amit Nandi studied physics at the Free University of Brussels in Belgium, where he did his research on computer generated holograms. Computer generated holograms are the key components of an optical computer, which is powered by photons running at the speed of light. He then worked with the university Cray supercomputer, sending batch jobs of programs written in Fortran. This gave him a taste for computing, which kept growing. He has worked extensively on large business reengineering initiatives, using SAP as the main enabler. He focused for the last 15 years on start-ups in the data space, pioneering new areas of the information technology landscape. He is currently focusing on large-scale data-intensive applications as an enterprise architect, data engineer, and software developer. He understands and speaks seven human languages. Although Python is his computer language of choice, he aims to be able to write fuently in seven computer languages too. www.it-ebooks.info Acknowledgment I want to express my profound gratitude to my parents for their unconditional love and strong support in all my endeavors. This book arose from an initial discussion with Richard Gall, an acquisition editor at Packt Publishing. Without this initial discussion, this book would never have happened. So, I am grateful to him. The follow ups on discussions and the contractual terms were agreed with Rebecca Youe. I would like to thank her for her support. I would also like to thank Merint Mathew, a content editor who helped me bring this book to the fnish line. I am thankful to Merint for his subtle persistence and tactful support during the write ups and revisions of this book. We are standing on the shoulders of giants. I want to acknowledge some of the giants who helped me shape my thinking. I want to recognize the beauty, elegance, and power of Python as envisioned by Guido van Rossum. My respectful gratitude goes to Matei Zaharia and the team at Berkeley AMP Lab and Databricks for developing a new approach to computing with Spark and Mesos. Travis Oliphant, Peter Wang, and the team at Continuum.io are doing a tremendous job of keeping Python relevant in a fast-changing computing landscape. Thank you to you all. www.it-ebooks.info About the Reviewers Manuel Ignacio Franco Galeano is a software developer from Colombia. He holds a computer science degree from the University of Quindío. At the moment of publication of this book, he was studying to get his MSc in computer science from University College Dublin, Ireland. He has a wide range of interests that include distributed systems, machine learning, micro services, and so on. He is looking for a way to apply machine learning techniques to audio data in order to help people learn more about music. Rahul Kavale works as a software developer at TinyOwl Ltd. He is interested in multiple technologies ranging from building web applications to solving big data problems. He has worked in multiple languages, including Scala, Ruby, and Java, and has worked on Apache Spark, Apache Storm, Apache Kafka, Hadoop, and Hive. He enjoys writing Scala. Functional programming and distributed computing are his areas of interest. He has been using Spark since its early stage for varying use cases. He has also helped with the review for the Pragmatic Scala book. www.it-ebooks.info Daniel Lemire has a BSc and MSc in mathematics from the University of Toronto and a PhD in engineering mathematics from the Ecole Polytechnique and the Université de Montréal. He is a professor of computer science at the Université du Québec. He has also been a research offcer at the National Research Council of Canada and an entrepreneur. He has written over 45 peer-reviewed publications, including more than 25 journal articles. He has held competitive research grants for the last 15 years. He has been an expert on several committees with funding agencies (NSERC and FQRNT). He has served as a program committee member on leading computer science conferences (for example, ACM CIKM, ACM WSDM, ACM SIGIR, and ACM RecSys). His open source software has been used by major corporations such as Google and Facebook. His research interests include databases, information retrieval and high-performance programming. He blogs regularly on computer science at http://lemire.me/blog/. Chet Mancini is a data engineer at Intent Media, Inc in New York, where he works with the data science team to store and process terabytes of web travel data to build predictive models of shopper behavior. He enjoys functional programming, immutable data structures, and machine learning. He writes and speaks on topics surrounding data engineering and information architecture. He is a contributor to Apache Spark and other libraries in the Spark ecosystem. Chet has a master's degree in computer science from Cornell University. www.it-ebooks.info www.PacktPub.com Support fles, eBooks, discount offers, and more For support fles and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub fles available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. TM https://www2.packtpub.com/books/subscription/packtlib Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books. Why subscribe? • Fully searchable across every book published by Packt • Copy and paste, print, and bookmark content • On demand and accessible via a web browser Free access for Packt account holders If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access. www.it-ebooks.info Table of Contents Preface v Chapter 1: Setting Up a Spark Virtual Environment 1 Understanding the architecture of data-intensive applications 3 Infrastructure layer 4 Persistence layer 4 Integration layer 4 Analytics layer 5 Engagement layer 6 Understanding Spark 6 Spark libraries 7 PySpark in action 7 The Resilient Distributed Dataset 8 Understanding Anaconda 10 Setting up the Spark powered environment 12 Setting up an Oracle VirtualBox with Ubuntu 13 Installing Anaconda with Python 2.7 13 Installing Java 8 14 Installing Spark 15 Enabling IPython Notebook 16 Building our frst app with PySpark 17 Virtualizing the environment with Vagrant 22 Moving to the cloud 24 Deploying apps in Amazon Web Services 24 Virtualizing the environment with Docker 24 Summary 26 [ i ] www.it-ebooks.info Table of Contents Chapter 2: Building Batch and Streaming Apps with Spark 27 Architecting data-intensive apps 28 Processing data at rest 29 Processing data in motion 30 Exploring data interactively 31 Connecting to social networks 31 Getting Twitter data 32 Getting GitHub data 34 Getting Meetup data 34 Analyzing the data 35 Discovering the anatomy of tweets 35 Exploring the GitHub world 40 Understanding the community through Meetup 42 Previewing our app 47 Summary 48 Chapter 3: Juggling Data with Spark 49 Revisiting the data-intensive app architecture 50 Serializing and deserializing data 51 Harvesting and storing data 51 Persisting data in CSV 52 Persisting data in JSON 54 Setting up MongoDB 55 Installing