Apache Hadoop & Spark – What Is It ?

Total Page:16

File Type:pdf, Size:1020Kb

Apache Hadoop & Spark – What Is It ? Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Apache Hadoop & Spark { What is it ? Demo Joseph Allemandou JoalTech 17 / 05 / 2018 1 / 23 Hello! Hadoop { Spark Overview Joseph Allemandou J. Allemandou Generalities Hadoop Spark Demo 2 / 23 Plan Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark 1 Generalities on High Performance Computing (HPC) Demo 2 Apache Hadoop and Spark { A Glimpse 3 Demonstration 3 / 23 More computation power: Scale up vs. scale out Hadoop { Spark Overview J. Allemandou Scale Up Scale out Generalities Hadoop Spark Demo 4 / 23 More computation power: Scale up vs. scale out Hadoop { Spark Overview Scale out J. Allemandou Scale Up Generalities Hadoop Spark Demo 4 / 23 Parallel computing Hadoop { Spark Overview Things to consider when doing parallel computing: J. Allemandou Partitioning (tasks, data) Generalities Hadoop Spark Communications Demo Synchronization Data dependencies Load balancing Granularity IO Livermore Computing Center - Tutorial 5 / 23 Looking back - Since 1950 Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo Figure: Google Ngram Viewer 6 / 23 Looking back - Since 1950 Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo Figure: Google Ngram Viewer 6 / 23 Looking back - Since 1950 Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo Figure: Google Ngram Viewer 6 / 23 Looking back - Recent times Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo Figure: Google Trends 7 / 23 Same problem, different tools Hadoop { Spark Overview J. Allemandou Generalities Supercomputer Big Data Hadoop Spark Demo Dedicated hardware Commodity hardware Message Passing Interface 8 / 23 MPI Hadoop { Spark Overview J. Allemandou Generalities C / C++ / Fortran / Python Hadoop Spark Demo Low-level API - Send / receive messages a lot to do manually split the data assign tasks to workers handle synchronisation handle errors 9 / 23 Hadoop + Spark Hadoop { Spark Overview J. Allemandou Generalities Java / Scala / Python / R Hadoop Spark Demo High-level API - dataflows Less performant than MPI (see this scientific paper ) Easier to code than MPI (a lot!) High-availability oriented 10 / 23 Plan Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark 1 Generalities on High Performance Computing (HPC) Demo 2 Apache Hadoop and Spark { A Glimpse 3 Demonstration 11 / 23 Distributed storage and computation platform (Java, open source) HDFS { Hadoop Distributed File System YARN { Yet Another Resource Negotiator Built for scalability Up to thousands of nodes (see this tweet ) Can be easily extended (heterogeneous hardware) Is resilient to errors (to some extent) Working with multiple computation engines Hadoop MapReduce (the good old one) Apache Spark (the new kid on the block) Apache Hadoop (v2) in (almost) 3 points Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 12 / 23 Built for scalability Up to thousands of nodes (see this tweet ) Can be easily extended (heterogeneous hardware) Is resilient to errors (to some extent) Working with multiple computation engines Hadoop MapReduce (the good old one) Apache Spark (the new kid on the block) Apache Hadoop (v2) in (almost) 3 points Hadoop { Spark Overview J. Allemandou Distributed storage and computation platform (Java, open source) Generalities HDFS { Hadoop Distributed File System Hadoop Spark YARN { Yet Another Resource Negotiator Demo 12 / 23 Working with multiple computation engines Hadoop MapReduce (the good old one) Apache Spark (the new kid on the block) Apache Hadoop (v2) in (almost) 3 points Hadoop { Spark Overview J. Allemandou Distributed storage and computation platform (Java, open source) Generalities HDFS { Hadoop Distributed File System Hadoop Spark YARN { Yet Another Resource Negotiator Demo Built for scalability Up to thousands of nodes (see this tweet ) Can be easily extended (heterogeneous hardware) Is resilient to errors (to some extent) 12 / 23 Apache Hadoop (v2) in (almost) 3 points Hadoop { Spark Overview J. Allemandou Distributed storage and computation platform (Java, open source) Generalities HDFS { Hadoop Distributed File System Hadoop Spark YARN { Yet Another Resource Negotiator Demo Built for scalability Up to thousands of nodes (see this tweet ) Can be easily extended (heterogeneous hardware) Is resilient to errors (to some extent) Working with multiple computation engines Hadoop MapReduce (the good old one) Apache Spark (the new kid on the block) 12 / 23 Apache Hadoop - A big ecosysem Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 13 / 23 HDFS Files are split in blocks { Partitioning ! Blocks are replicated { Failure tolerance ! Master - Slave architecture (NameNode & DataNodes) YARN Execution containers with defined resources Scheduler manages resource sharing Master - Slave architecture (ResourceManager & NodeManagers) Global Execution engines exploit data locality Things to know about Apache Hadoop Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 14 / 23 YARN Execution containers with defined resources Scheduler manages resource sharing Master - Slave architecture (ResourceManager & NodeManagers) Global Execution engines exploit data locality Things to know about Apache Hadoop Hadoop { Spark Overview J. Allemandou HDFS Generalities Files are split in blocks { Partitioning ! Blocks are replicated { Failure tolerance ! Hadoop Spark Master - Slave architecture (NameNode & DataNodes) Demo 14 / 23 Global Execution engines exploit data locality Things to know about Apache Hadoop Hadoop { Spark Overview J. Allemandou HDFS Generalities Files are split in blocks { Partitioning ! Blocks are replicated { Failure tolerance ! Hadoop Spark Master - Slave architecture (NameNode & DataNodes) Demo YARN Execution containers with defined resources Scheduler manages resource sharing Master - Slave architecture (ResourceManager & NodeManagers) 14 / 23 Things to know about Apache Hadoop Hadoop { Spark Overview J. Allemandou HDFS Generalities Files are split in blocks { Partitioning ! Blocks are replicated { Failure tolerance ! Hadoop Spark Master - Slave architecture (NameNode & DataNodes) Demo YARN Execution containers with defined resources Scheduler manages resource sharing Master - Slave architecture (ResourceManager & NodeManagers) Global Execution engines exploit data locality 14 / 23 Apache Hadoop - The Frame (HDFS + YARN) Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 15 / 23 Apache Hadoop - The Frame (HDFS + YARN) Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 15 / 23 Apache Hadoop - The Frame (HDFS + YARN) Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 15 / 23 Apache Hadoop - MapReduce Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 16 / 23 Apache Hadoop - MapReduce Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 16 / 23 Apache Spark Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 17 / 23 Apache Spark Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 17 / 23 Programming data-flows instead of single map-reduce steps A lot less code for complex flows Data lineage allows for error recovery Decoupling of tasks and containers { Spark executors run multiple tasks Less executor management overhead Executors can reuse RAM - Caching! Why is Spark preferred to MapReduce? Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 18 / 23 Decoupling of tasks and containers { Spark executors run multiple tasks Less executor management overhead Executors can reuse RAM - Caching! Why is Spark preferred to MapReduce? Hadoop { Spark Overview J. Allemandou Programming data-flows instead of single map-reduce Generalities steps Hadoop Spark A lot less code for complex flows Demo Data lineage allows for error recovery 18 / 23 Why is Spark preferred to MapReduce? Hadoop { Spark Overview J. Allemandou Programming data-flows instead of single map-reduce Generalities steps Hadoop Spark A lot less code for complex flows Demo Data lineage allows for error recovery Decoupling of tasks and containers { Spark executors run multiple tasks Less executor management overhead Executors can reuse RAM - Caching! 18 / 23 Spark - A big ecosysem as well Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 19 / 23 Spark - An example of (very) complex flow Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo Mediawiki History Reconstruction Job 20 / 23 Plan Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark 1 Generalities on High Performance Computing (HPC) Demo 2 Apache Hadoop and Spark { A Glimpse 3 Demonstration 21 / 23 Scala Spark / Toree / Jupyter notebook Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo Notebook with Spark on Docker 22 / 23 Thank you! Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 23 / 23.
Recommended publications
  • Corpora: Google Ngram Viewer and the Corpus of Historical American English
    EuroAmerican Journal of Applied Linguistics and Languages E JournALL Volume 1, Issue 1, November 2014, pages 48 68 ISSN 2376 905X DOI - - www.e journall.org- http://dx.doi.org/10.21283/2376905X.1.4 - Exploring mega-corpora: Google Ngram Viewer and the Corpus of Historical American English ERIC FRIGINALa1, MARSHA WALKERb, JANET BETH RANDALLc aDepartment of Applied Linguistics and ESL, Georgia State University bLanguage Institute, Georgia Institute of Technology cAmerican Language Institute, New York University, Tokyo Received 10 December 2013; received in revised form 17 May 2014; accepted 8 August 2014 ABSTRACT EN The creation of internet-based mega-corpora such as the Corpus of Contemporary American English (COCA), the Corpus of Historical American English (COHA) (Davies, 2011a) and the Google Ngram Viewer (Cohen, 2010) signals a new phase in corpus-based research that provides both novice and expert researchers immediate access to a variety of online texts and time-coded data. This paper explores the applications of these corpora in the analysis of academic word lists, in particular, Coxhead’s (2000) Academic Word List (AWL). Coxhead (2011) has called for further research on the AWL with larger corpora, noting that learners’ use of academic vocabulary needs to address for the AWL to be useful in various contexts. Results show that words on the AWL are declining in overall frequency from 1990 to the present. Implications about the AWL and future directions in corpus-based research utilizing mega-corpora are discussed. Keywords: GOOGLE N-GRAM VIEWER, CORPUS OF HISTORICAL AMERICAN ENGLISH, MEGA-CORPORA, TREND STUDIES. ES La creación de megacorpus basados en Internet, tales como el Corpus of Contemporary American English (COCA), el Corpus of Historical American English (COHA) (Davies, 2011a) y el Visor de Ngramas de Google (Cohen, 2010), anuncian una nueva fase en la investigación basada en corpus, pues proporcionan, tanto a investigadores noveles como a expertos, un acceso inmediato a una gran diversidad de textos online y datos codificados con time-code.
    [Show full text]
  • BEYOND JEWISH IDENTITY Rethinking Concepts and Imagining Alternatives
    This book is subject to a CC-BY-NC license. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc/4.0/ BEYOND JEWISH IDENTITY Rethinking Concepts and Imagining Alternatives This book is subject to a CC-BY-NC license. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc/4.0/ This book is subject to a CC-BY-NC license. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc/4.0/ BEYOND JEWISH IDENTITY rethinking concepts and imagining alternatives Edited by JON A. LEVISOHN and ARI Y. KELMAN BOSTON 2019 This book is subject to a CC-BY-NC license. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc/4.0/ Library of Congress Control Number:2019943604 The research for this book and its publication were made possible by the generous support of the Jack, Joseph and Morton Mandel Center for Studies in Jewish Education, a partnership between Brandeis University and the Jack, Joseph and Morton Mandel Foundation of Cleveland, Ohio. © Academic Studies Press, 2019 ISBN 978-1-644691-16-8 (Hardcover) ISBN 978-1-644691-29-8 (Paperback) ISBN 978-1-644691-17-5 (Open Access PDF) Book design by Kryon Publishing Services (P) Ltd. www.kryonpublishing.com Cover design by Ivan Grave Published by Academic Studies Press 1577 Beacon Street Brookline, MA 02446, USA [email protected] www.academicstudiespress.com Effective May 26th 2020, this book is subject to a CC-BY-NC license. To view a copy of this license, visit https://creativecommons.org/licenses/ by-nc/4.0/.
    [Show full text]
  • CS15-319 / 15-619 Cloud Computing
    CS15-319 / 15-619 Cloud Computing Recitation 13 th th November 18 and 20 , 2014 Announcements • Encounter a general bug: – Post on Piazza • Encounter a grading bug: – Post Privately on Piazza • Don’t ask if my answer is correct • Don’t post code on Piazza • Search before posting • Post feedback on OLI Last Week’s Project Reflection • Provision your own Hadoop cluster • Write a MapReduce program to construct inverted lists for the Project Gutenberg data • Run your code from the master instance • Piazza Highlights – Different versions of Hadoop API: Both old and new should be fine as long as your program is consistent Module to Read • UNIT 5: Distributed Programming and Analytics Engines for the Cloud – Module 16: Introduction to Distributed Programming for the Cloud – Module 17: Distributed Analytics Engines for the Cloud: MapReduce – Module 18: Distributed Analytics Engines for the Cloud: Pregel – Module 19: Distributed Analytics Engines for the Cloud: GraphLab Project 4 • MapReduce – Hadoop MapReduce • Input Text Predictor: NGram Generation – NGram Generation • Input Text Predictor: Language Model and User Interface – Language Model Generation Input Text Predictor • Suggest words based on letters already typed n-gram • An n-gram is a phrase with n contiguous words Google-Ngram Viewer • The result seems logical: the singular “is” becomes the dominant verb after the American Civil War. Google-Ngram Viewer • “one nation under God” and “one nation indivisible.” • “under God” was signed into law by President Eisenhower in 1954. How to Construct an Input Text Predictor? 1. Given a language corpus – Project Gutenberg (2.5 GB) – English Language Wikipedia Articles (30 GB) 2.
    [Show full text]
  • UC Riverside UCR Honors Capstones 2016-2017
    UC Riverside UCR Honors Capstones 2016-2017 Title GTWENDS Permalink https://escholarship.org/uc/item/41g0w0q9 Author Asfour, Mark Publication Date 2017-12-08 Data Availability The data associated with this publication are within the manuscript. eScholarship.org Powered by the California Digital Library University of California GTWENDS By Mark Jeffrey Asfour A capstone project submitted for Graduation with University Honors October 20, 2016 University Honors University of California, Riverside APPROVED ______________________________ Dr. Evangelos Christidis Department of Computer Science & Engineering ______________________________ Dr. Richard Cardullo, Howard H Hays Chair and Faculty Director, University Honors Associate Vice Provost, Undergraduate Education Abstract GTWENDS is an online interactive map of the United States of America that displays the locations of trending Twitter tweets, Google Search trends, and Google Hot Trends topics. States on the map are overlaid with a blue color where Twitter trends originate and a red color where Google trends originate with respective opacity levels varying based on the levels of interest from each website. Through the use of web crawling, map-reducing, and utilizing distributed processing, this project allows visitors to have an interactive geographic visual representation of current national social media topic activities. Visitors can use GTWENDS to learn about social media trends by observing where trends originate from, comparing the contrasts and similarities between trends from Twitter and Google, understanding what types of events trigger mass social media sharing, and much more. ii Acknowledgements I have worked on and developed GTWENDS with my classmates, Mehran Ghamaty (University of California, Riverside Spring 2016) and Jacob Xu (University of California, Riverside Spring 2016), during our senior design project class, CS 179G Databases Spring 2016, under the supervision of my professor and mentor, Professor Evangelos Christidis.
    [Show full text]
  • Internet and Data
    Internet and Data Internet and Data Resources and Risks and Power Kenneth W. Regan CSE199, Fall 2017 Internet and Data Outline Week 1 of 2: Data and the Internet What is data exactly? How much is there? How is it growing? Where data resides|in reality and virtuality. The Cloud. The Farm. How data may be accessed. Importance of structure and markup. Structures that help algorithms \crunch" data. Formats and protocols for enabling access to data. Protocols for controlling access and changes to data. SQL: Select. Insert. Update. Delete. Create. Drop. Dangers to privacy. Dangers of crime. (Dis-)Advantages of online data. [Week 1 Activity: Trying some SQL queries.] Internet and Data What Exactly Is \Data"? Several different aspects and definitions: 1 The entire track record of (your) online activity. Note that any \real data" put online was part of online usage. Exception could be burning CD/DVDs and other hard media onto a server, but nowadays dwarfed by uploads. So this is the most inclusive and expansive definition. Certainly what your carrier means by \data"|if you re-upload a file, it counts twice. 2 Structured information for a particular context or purpose. What most people mean by \data." Data repositories often specify the context and form. Structure embodied in formats and access protocols. 3 In-between is what's commonly called \Unstructured Information" Puts the M in Data Mining. Hottest focus of consent, rights, and privacy issues. Internet and Data How Much Data Is There? That is, How Big Is the Internet? Searchable Web Deep Web (I maintain several gigabytes of deep-web textual data.
    [Show full text]
  • Introduction to Text Analysis: a Coursebook
    Table of Contents 1. Preface 1.1 2. Acknowledgements 1.2 3. Introduction 1.3 1. For Instructors 1.3.1 2. For Students 1.3.2 3. Schedule 1.3.3 4. Issues in Digital Text Analysis 1.4 1. Why Read with a Computer? 1.4.1 2. Google NGram Viewer 1.4.2 3. Exercises 1.4.3 5. Close Reading 1.5 1. Close Reading and Sources 1.5.1 2. Prism Part One 1.5.2 3. Exercises 1.5.3 6. Crowdsourcing 1.6 1. Crowdsourcing 1.6.1 2. Prism Part Two 1.6.2 3. Exercises 1.6.3 7. Digital Archives 1.7 1. Text Encoding Initiative 1.7.1 2. NINES and Digital Archives 1.7.2 3. Exercises 1.7.3 8. Data Cleaning 1.8 1. Problems with Data 1.8.1 2. Zotero 1.8.2 3. Exercises 1.8.3 9. Cyborg Readers 1.9 1. How Computers Read Texts 1.9.1 2. Voyant Part One 1.9.2 3. Exercises 1.9.3 10. Reading at Scale 1.10 1. Distant Reading 1.10.1 2. Voyant Part Two 1.10.2 3. Exercises 1.10.3 11. Topic Modeling 1.11 1. Bags of Words 1.11.1 2. Topic Modeling Case Study 1.11.2 3. Exercises 1.11.3 12. Classifiers 1.12 1. Supervised Classifiers 1.12.1 2. Classifying Texts 1.12.2 3. Exercises 1.12.3 13. Sentiment Analysis 1.13 1. Sentiment Analysis 1.13.1 2.
    [Show full text]
  • Google Ngram Viewer Turns Snippets Into Insight Computers Have
    PART I Shredding the Great Books - or - Google Ngram Viewer Turns Snippets into Insight Computers have the ability to store enormous amounts of information. But while it may seem good to have as big a pile of information as possible, as the pile gets bigger, it becomes increasingly difficult to find any particular item. In the old days, helping people find information was the job of phone books, indexes in books, catalogs, card catalogs, and especially librarians. Goals for this Lecture 1. Investigate the history of digitizing text; 2. Understand the goal of Google Books; 3. Describe logistical, legal, and software problems of Google Books; 4. To see some of the problems that arise when old texts are used. Computers Are Reshaping Our World This class looks at ways in which computers have begun to change the way we live. Many recent discoveries and inventions can be attributed to the power of computer hardware (the machines), the development of computer programs (the instructions we give the machines), and more importantly to the ability to think about problems in new ways that make it possible to solve them with computers, something we call computational thinking. Issues in Computational Thinking What problems can computers help us with? What is different about how a computer seeks a solution? Why are computer solutions often not quite the same as what a human would have found? How do computer programmers write out instructions that tell a computer how to solve a problem, what problem to solve, and how to report the answer? How does a computer \think"? How does it receive instructions, store infor- mation in a memory, represent and manipulate the numbers and words and pictures and ideas that we refer to as we think about a problem? Computers are Part of a Complicated World We will see that making a computer carry out a task in a few seconds actually can take years of thinking, research, and experiment.
    [Show full text]
  • Finding the Face of Your Data 
    Finding the Face of Your Data There’s been an explosion in data assets Growth of the “digital universe”1 Data overload in context2 40,000 1 EB = 1 billion gigabytes x70 30,000 20,000 10,000 Exabytes 2009 2020 IDC estimates that “tagged” information accounts for only about 3% The amount of information generated by humanity during the first of the digital universe, with analyzed information at 0.5%. The value day of a baby’s life today is equivalent to 70 times the information of big data technology lies in exploring the “untapped pools.” contained in the Library of Congress. Enterprise expectations are as big as the data Big data spending forecast by component3 50 Compute 31% CAGR Storage 40 Networking 30 Infrastructure Software SQL Database Software 20 NoSQL Database Software Application Software 10 Professional Services Billion Dollars Xaas 2011 2012 2013 2014 2015 2016 2017 According to a Wikibon study, big data spend will shift from infrastructure and middleware to value-add services and software during the next five years. Infrastructure, middleware, and technical services will likely become increasingly commoditized as they mature and common standards are adopted. We also note that this study did not include the costs associated with the business and domain experts’ time – a critical element of actionable insight. Taking advantage of data requires new tools ... Traditional and non-traditional value in data Data diving Pattern finding In taking advantage of new data assets – The other side of that same coin, from internal, external, structured, and seeking and patterning for previously unstructured data – and analytics tools, unasked and unanswerable questions, the most common form of value is realized is less common, but potentially more through exploiting deeper detail for new important to the enterprise.
    [Show full text]
  • Unit 4 Lesson 1
    Unit 4 Lesson 1 What is Big Data? Resources Name(s)_______________________________________________ Period ______ Date ________________ Unit 4 Lesson 01 Activity Guide - Big Data Sleuth Card Directions: Web Sites: ● With a partner, select one of the tools in the list to the right. 1. Web archive http://www.archive.org ​ ● Determine what the tool is showing. 2. Measure of America http://www.measureofamerica.org/maps/ ​ ● Find the source of the data it allows you to explore. 3. Wind Sensor network http://earth.nullschool.net/ ​ ● Complete the table below. 4. Twitter sentiment https://www.csc.ncsu.edu/faculty/healey/tweet_viz/tweet_app/ ​ 5. Alternative Fuel Locator http://www.afdc.energy.gov/locator/stations/ ​ Website Name What is this website potentially useful for? What ​ kinds of problems could the provided information be used to solve? Is the provided visualization useful? Does it provide insight into the data? How does it help you look at a lot of information at once? How could it improve? Where is the data coming from? Check for “About”, “Download”, or “API”. You may also need to do a web search. ● Is the data from one source or many? ● Is it static or live? ● Is the source reputable? Why or why not? ● Add a link to the raw data if you can find one. Do you consider this “big” data? Explain your reasoning. 1 Unit 4 Lesson 2 Finding Trends with Visualizations Resources Unit 4 Lesson 2 Name(s)_______________________________________________ Period ______ Date ___________________ Activity Guide - Exploring Trends What’s a Trend? When you post information to a social network, watch a video online, or simply search for information on a search engine, some of that data is collected, and you reveal what topics are currently on your mind.
    [Show full text]
  • Latency: Certain Operations Have a Much Higher Latency Than Other Operations Due to Network Communication
    Gi2M+v "B; .i MHvbBb rBi? a+H M/ aT`F >2i?2` JBHH2` Distribution Distribution introduces important concerns beyond what we had to worry about when dealing with parallelism in the shared memory case: 111-- Partial failure: crash failures of a subset of the machines involved in a distributed computation . ...,. Latency: certain operations have a much higher latency than other operations due to network communication. Distribution Distribution introduces important concerns beyond what we had to worry about when dealing with parallelism in the shared memory case: 111-- Partial failure: crash failures of a subset of the machines involved in a distributed computation . ...,. Latency: certain operations have a much higher latency than other operations due to network communication. Latency cannot be masked completely; it will be an important aspect that also impacts the programming model. Important Latency Numbers L 1 cache reference 0.5ns Branch mispredict 5ns L2 cache reference 7ns Mutex lock/unlock 25ns Main memory reference l00ns Compress lK bytes with Zippy 3,000ns == 3µs Send 2K bytes over lGbps network 20,000ns == 20µs SSD random read 150,000ns == 150µs Read 1 MB sequentially from 250,000ns == 250µs Roundtrip within same datacenter 500,000ns == 0.5ms Read 1MB sequentially from SSD 1,000,000ns == lms Disk seek 10,000,000ns == l0ms Read 1MB sequentially from disk 20,000,000ns == 20ms Send packet US ---+ Europe ---+ US 150,000,000ns == 150ms Original compilation by Jeff Dean & Peter Norvig, w/ contributions by Joe Hellerstein & Erik Meijer
    [Show full text]
  • Unemployment Rate Forecasting Using Google Trends Bachelor Thesis in Econometrics & Operations Research
    Unemployment rate forecasting using Google trends Bachelor Thesis in Econometrics & Operations Research erasmus university rotterdam erasmus school of economics Author: Olivier O. Smit, 415283 Supervisor: Jochem Oorschot July 8, 2018 Abstract In order to make sound economic decisions, it is of great importance to be able to predict and interpret macro-economic variables. Researchers are therefore seeking continuously to improve the prediction performance. One of the main economic indicators is the US unemployment rate. In this paper, we empirically analyze whether, and to what extent, Google search data have additional predictive power in forecasting the US unemployment rate. This research consists of two parts. First, we look for and select Google search data with potential predicitive power. Second, we evaluate the performance of level and directional forecasts. Here, we make use of different models, based on both econometric and machine learning techniques. We find that Google trends improve the predictive accuracy in all used forecasting methods. Lastly, we discuss the limitations of our research and possible future research suggestions. 1 1 Introduction Nowadays, search engines are intensely used platforms. They serve as the gates to the Internet and at the same time help users to create order in the immense amount of websites and data available on the Internet. About a decade ago, researchers have realized that these search engines contain an enormous quantity of new data that can be of great additional value for modeling all kinds of processes. Previous researches have already proven this to be true. For example, Constant and Zimmermann (2008) have shown that including Google - the largest search engine in terms of search activity and amount of users - query data can be very useful in measuring economic processes and political activities.
    [Show full text]
  • 2016 Google Food Trends Report
    Food Trends 2016 U.S. Report [email protected] Intro With every query typed into a search bar, we are given a glimpse into user considerations or intentions. By compiling top searches, we are able to render a strong representation of the United States’ population and gain insight into this population’s behavior. In our Google Food Trends Report, we are excited to bring forth the power of data into the hands of the marketers, product developers, restaurateurs, chefs, and foodies. The goal of this report is to share useful data for planning purposes accompanied by curated styles of what we believe can make for impactful trends. We are proud to share this iteration and look forward to hearing back from you. Jon Lorenzini | Senior Analytics Lead, Food & Beverage Lisa Lovallo | Global Insights Manager, Food & Beverage Olivier Zimmer | Trends Data Scientist Yarden Horwitz | Trends Brand Strategist Methodology To compile a list of accurate trends within the food industry, we pulled top volume queries related to the food category and looked at their monthly volume from January 2014 to February 2016. We first removed any seasonal effect, and then measured the year-over-year growth, velocity, and acceleration for each search query. Based on these metrics, we were able to classify the queries into similar trend patterns. We then curated the most significant trends to illustrate interesting shifts in behavior. Query De-seasonalized Trend 2004 2006 2008 2010 2012 2014 Query 2016 Characteristics Part One Part Two Part Three top risers a spotlight on an extensive list of and decliners top trending the top volume themes food searches Trend Categories To identify top trends, we categorized past data into six different clusters based on similar behaviors.
    [Show full text]