Mapreduce: a Major Step Backwards - the Database Column
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Mapreduce and Beyond
MapReduce and Beyond Steve Ko 1 Trivia Quiz: What’s Common? Data-intensive compung with MapReduce! 2 What is MapReduce? • A system for processing large amounts of data • Introduced by Google in 2004 • Inspired by map & reduce in Lisp • OpenSource implementaMon: Hadoop by Yahoo! • Used by many, many companies – A9.com, AOL, Facebook, The New York Times, Last.fm, Baidu.com, Joost, Veoh, etc. 3 Background: Map & Reduce in Lisp • Sum of squares of a list (in Lisp) • (map square ‘(1 2 3 4)) – Output: (1 4 9 16) [processes each record individually] 1 2 3 4 f f f f 1 4 9 16 4 Background: Map & Reduce in Lisp • Sum of squares of a list (in Lisp) • (reduce + ‘(1 4 9 16)) – (+ 16 (+ 9 (+ 4 1) ) ) – Output: 30 [processes set of all records in a batch] 4 9 16 f f f returned iniMal 1 5 14 30 5 Background: Map & Reduce in Lisp • Map – processes each record individually • Reduce – processes (combines) set of all records in a batch 6 What Google People Have NoMced • Keyword search Map – Find a keyword in each web page individually, and if it is found, return the URL of the web page Reduce – Combine all results (URLs) and return it • Count of the # of occurrences of each word Map – Count the # of occurrences in each web page individually, and return the list of <word, #> Reduce – For each word, sum up (combine) the count • NoMce the similariMes? 7 What Google People Have NoMced • Lots of storage + compute cycles nearby • Opportunity – Files are distributed already! (GFS) – A machine can processes its own web pages (map) CPU CPU CPU CPU CPU CPU CPU CPU -
Uila Supported Apps
Uila Supported Applications and Protocols updated Oct 2020 Application/Protocol Name Full Description 01net.com 01net website, a French high-tech news site. 050 plus is a Japanese embedded smartphone application dedicated to 050 plus audio-conferencing. 0zz0.com 0zz0 is an online solution to store, send and share files 10050.net China Railcom group web portal. This protocol plug-in classifies the http traffic to the host 10086.cn. It also 10086.cn classifies the ssl traffic to the Common Name 10086.cn. 104.com Web site dedicated to job research. 1111.com.tw Website dedicated to job research in Taiwan. 114la.com Chinese web portal operated by YLMF Computer Technology Co. Chinese cloud storing system of the 115 website. It is operated by YLMF 115.com Computer Technology Co. 118114.cn Chinese booking and reservation portal. 11st.co.kr Korean shopping website 11st. It is operated by SK Planet Co. 1337x.org Bittorrent tracker search engine 139mail 139mail is a chinese webmail powered by China Mobile. 15min.lt Lithuanian news portal Chinese web portal 163. It is operated by NetEase, a company which 163.com pioneered the development of Internet in China. 17173.com Website distributing Chinese games. 17u.com Chinese online travel booking website. 20 minutes is a free, daily newspaper available in France, Spain and 20minutes Switzerland. This plugin classifies websites. 24h.com.vn Vietnamese news portal 24ora.com Aruban news portal 24sata.hr Croatian news portal 24SevenOffice 24SevenOffice is a web-based Enterprise resource planning (ERP) systems. 24ur.com Slovenian news portal 2ch.net Japanese adult videos web site 2Shared 2shared is an online space for sharing and storage. -
State Finalist in Doodle 4 Google Contest Is Pine Grove 4Th Grader
Serving the Community of Orcutt, California • July 22, 2009 • www.OrcuttPioneer.com • Circulation 17,000 + State Finalist in Doodle 4 Google Bent Axles Car Show Brings Contest is Pine Grove 4th Grader Rolling History to Old Orcutt “At Google we believe in thinking big What she heard was a phone message and dreaming big, and we can’t think of regarding her achievement. Once the anything more important than encourag- relief at not being in trouble subsided, ing students to do the same.” the excitement set it. This simple statement is the philoso- “It was shocking!” she says, “And also phy behind the annual Doodle 4 Google important. It made me feel known.” contest in which the company invites When asked why she chose to enter the students from all over the country to re- contest, Madison says, “It’s a chance for invent their homepage logo. This year’s children to show their imagination. Last theme was entitled “What I Wish For The year I wasn’t able to enter and I never World”. thought of myself Pine Grove El- as a good draw- ementary School er, but I thought teacher Kelly Va- it would be fun. nAllen thought And I didn’t think this contest was I’d win!” the prefect com- Mrs. VanAllen is plement to what quick to point out s h e i s a l w a y s Madison’s won- trying to instill derful creative in her students side and is clear- – that the world ly proud of her is something not students’ global Bent Axles members display their rolling works of art. -
Integrating Crowdsourcing with Mapreduce for AI-Hard Problems ∗
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence CrowdMR: Integrating Crowdsourcing with MapReduce for AI-Hard Problems ∗ Jun Chen, Chaokun Wang, and Yiyuan Bai School of Software, Tsinghua University, Beijing 100084, P.R. China [email protected], [email protected], [email protected] Abstract In this paper, we propose CrowdMR — an extended MapReduce model based on crowdsourcing to handle AI- Large-scale distributed computing has made available hard problems effectively. Different from pure crowdsourc- the resources necessary to solve “AI-hard” problems. As a result, it becomes feasible to automate the process- ing solutions, CrowdMR introduces confidence into MapRe- ing of such problems, but accuracy is not very high due duce. For a common AI-hard problem like classification to the conceptual difficulty of these problems. In this and recognition, any instance of that problem will usually paper, we integrated crowdsourcing with MapReduce be assigned a confidence value by machine learning algo- to provide a scalable innovative human-machine solu- rithms. CrowdMR only distributes the low-confidence in- tion to AI-hard problems, which is called CrowdMR. In stances which are the troublesome cases to human as HITs CrowdMR, the majority of problem instances are auto- via crowdsourcing. This strategy dramatically reduces the matically processed by machine while the troublesome number of HITs that need to be answered by human. instances are redirected to human via crowdsourcing. To showcase the usability of CrowdMR, we introduce an The results returned from crowdsourcing are validated example of gender classification using CrowdMR. For the in the form of CAPTCHA (Completely Automated Pub- lic Turing test to Tell Computers and Humans Apart) instances whose confidence values are lower than a given before adding to the output. -
Mapreduce: Simplified Data Processing On
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat [email protected], [email protected] Google, Inc. Abstract given day, etc. Most such computations are conceptu- ally straightforward. However, the input data is usually MapReduce is a programming model and an associ- large and the computations have to be distributed across ated implementation for processing and generating large hundreds or thousands of machines in order to finish in data sets. Users specify a map function that processes a a reasonable amount of time. The issues of how to par- key/value pair to generate a set of intermediate key/value allelize the computation, distribute the data, and handle pairs, and a reduce function that merges all intermediate failures conspire to obscure the original simple compu- values associated with the same intermediate key. Many tation with large amounts of complex code to deal with real world tasks are expressible in this model, as shown these issues. in the paper. As a reaction to this complexity, we designed a new Programs written in this functional style are automati- abstraction that allows us to express the simple computa- cally parallelized and executed on a large cluster of com- tions we were trying to perform but hides the messy de- modity machines. The run-time system takes care of the tails of parallelization, fault-tolerance, data distribution details of partitioning the input data, scheduling the pro- and load balancing in a library. Our abstraction is in- gram's execution across a set of machines, handling ma- spired by the map and reduce primitives present in Lisp chine failures, and managing the required inter-machine and many other functional languages. -
Overview of Mapreduce and Spark
Overview of MapReduce and Spark Mirek Riedewald This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. Key Learning Goals • How many tasks should be created for a job running on a cluster with w worker machines? • What is the main difference between Hadoop MapReduce and Spark? • For a given problem, how many Map tasks, Map function calls, Reduce tasks, and Reduce function calls does MapReduce create? • How can we tell if a MapReduce program aggregates data from different Map calls before transmitting it to the Reducers? • How can we tell if an aggregation function in Spark aggregates locally on an RDD partition before transmitting it to the next downstream operation? 2 Key Learning Goals • Why do input and output type have to be the same for a Combiner? • What data does a single Mapper receive when a file is the input to a MapReduce job? And what data does the Mapper receive when the file is added to the distributed file cache? • Does Spark use the equivalent of a shared- memory programming model? 3 Introduction • MapReduce was proposed by Google in a research paper. Hadoop MapReduce implements it as an open- source system. – Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December 2004 • Spark originated in academia—at UC Berkeley—and was proposed as an improvement of MapReduce. – Matei Zaharia, Mosharaf Chowdhury, Michael J. -
Finding Connected Components in Huge Graphs with Mapreduce
CC-MR - Finding Connected Components in Huge Graphs with MapReduce Thomas Seidl, Brigitte Boden, and Sergej Fries Data Management and Data Exploration Group RWTH Aachen University, Germany fseidl, boden, [email protected] Abstract. The detection of connected components in graphs is a well- known problem arising in a large number of applications including data mining, analysis of social networks, image analysis and a lot of other related problems. In spite of the existing very efficient serial algorithms, this problem remains a subject of research due to increasing data amounts produced by modern information systems which cannot be handled by single workstations. Only highly parallelized approaches on multi-core- servers or computer clusters are able to deal with these large-scale data sets. In this work we present a solution for this problem for distributed memory architectures, and provide an implementation for the well-known MapReduce framework developed by Google. Our algorithm CC-MR sig- nificantly outperforms the existing approaches for the MapReduce frame- work in terms of the number of necessary iterations, communication costs and execution runtime, as we show in our experimental evaluation on synthetic and real-world data. Furthermore, we present a technique for accelerating our implementation for datasets with very heterogeneous component sizes as they often appear in real data sets. 1 Introduction Web and social graphs, chemical compounds, protein and co-author networks, XML databases - graph structures are a very natural way for representing com- plex data and therefore appear almost everywhere in data processing. Knowledge extraction from these data often relies (at least as a preprocessing step) on the problem of finding connected components within these graphs. -
Large-Scale Graph Mining @ Google NY
Large-scale Graph Mining @ Google NY Vahab Mirrokni Google Research New York, NY DIMACS Workshop Large-scale graph mining Many applications Friend suggestions Recommendation systems Security Advertising Benefits Big data available Rich structured information New challenges Process data efficiently Privacy limitations Google NYC Large-scale graph mining Develop a general-purpose library of graph mining tools for XXXB nodes and XT edges via MapReduce+DHT(Flume), Pregel, ASYMP Goals: • Develop scalable tools (Ranking, Pairwise Similarity, Clustering, Balanced Partitioning, Embedding, etc) • Compare different algorithms/frameworks • Help product groups use these tools across Google in a loaded cluster (clients in Search, Ads, Youtube, Maps, Social) • Fundamental Research (Algorithmic Foundations and Hybrid Algorithms/System Research) Outline Three perspectives: • Part 1: Application-inspired Problems • Algorithms for Public/Private Graphs • Part 2: Distributed Optimization for NP-Hard Problems • Distributed algorithms via composable core-sets • Part 3: Joint systems/algorithms research • MapReduce + Distributed HashTable Service Problems Inspired by Applications Part 1: Why do we need scalable graph mining? Stories: • Algorithms for Public/Private Graphs, • How to solve a problem for each node on a public graph+its own private network • with Chierchetti,Epasto,Kumar,Lattanzi,M: KDD’15 • Ego-net clustering • How to use graph structures and improve collaborative filtering • with EpastoLattanziSebeTaeiVerma, Ongoing • Local random walks for conductance -
Multidimensional Modeling and Analysis of Large and Complex Watercourse Data: an OLAP-Based Solution
Multidimensional modeling and analysis of large and complex watercourse data: an OLAP-based solution Kamal Boulil, Florence Le Ber, Sandro Bimonte, Corinne Grac, Flavie Cernesson To cite this version: Kamal Boulil, Florence Le Ber, Sandro Bimonte, Corinne Grac, Flavie Cernesson. Multidimensional modeling and analysis of large and complex watercourse data: an OLAP-based solution. Ecological Informatics, Elsevier, 2014, 24, pp.30. 10.1016/j.ecoinf.2014.07.001. hal-01057105 HAL Id: hal-01057105 https://hal.archives-ouvertes.fr/hal-01057105 Submitted on 20 Nov 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Ecological Informatics Multidimensional modeling and analysis of large and complex watercourse data: an OLAP-based solution a a b c d Kamal Boulil , Florence Le Ber , Sandro Bimonte , Corinne Grac , Flavie Cernesson a Laboratoire ICube, Université de Strasbourg, ENGEES, CNRS, 300 bd Sébastien Brant, F67412 Illkirch, France b Equipe Copain, UR TSCF, Irstea—Centre de Clermont-Ferrand, 24 Avenue des Landais, F63170 Aubière, France c Laboratoire LIVE, Université de Strasbourg/ENGEES, CNRS, rue de l'Argonne, F67000 Strasbourg, France d AgroParisTech—TETIS, 500 rue Jean François Breton, F34090 Montpellier, France 1. -
Mapreduce Indexing Strategies: Studying Scalability and Efficiency ⇑ Richard Mccreadie , Craig Macdonald, Iadh Ounis
Information Processing and Management xxx (2011) xxx–xxx Contents lists available at ScienceDirect Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman MapReduce indexing strategies: Studying scalability and efficiency ⇑ Richard McCreadie , Craig Macdonald, Iadh Ounis Department of Computing Science, University of Glasgow, Glasgow G12 8QQ, United Kingdom article info abstract Article history: In Information Retrieval (IR), the efficient indexing of terabyte-scale and larger corpora is Received 11 February 2010 still a difficult problem. MapReduce has been proposed as a framework for distributing Received in revised form 17 August 2010 data-intensive operations across multiple processing machines. In this work, we provide Accepted 19 December 2010 a detailed analysis of four MapReduce indexing strategies of varying complexity. Moreover, Available online xxxx we evaluate these indexing strategies by implementing them in an existing IR framework, and performing experiments using the Hadoop MapReduce implementation, in combina- Keywords: tion with several large standard TREC test corpora. In particular, we examine the efficiency MapReduce of the indexing strategies, and for the most efficient strategy, we examine how it scales Indexing Efficiency with respect to corpus size, and processing power. Our results attest to both the impor- Scalability tance of minimising data transfer between machines for IO intensive tasks like indexing, Hadoop and the suitability of the per-posting list MapReduce indexing strategy, in particular for indexing at a terabyte-scale. Hence, we conclude that MapReduce is a suitable framework for the deployment of large-scale indexing. Ó 2010 Elsevier Ltd. All rights reserved. 1. Introduction The Web is the largest known document repository, and poses a major challenge for Information Retrieval (IR) systems, such as those used by Web search engines or Web IR researchers. -
Applying OLAP Pre-Aggregation Techniques to Speed up Aggregate Query Processing in Array Databases by Angélica Garcıa Gutiérr
Applying OLAP Pre-Aggregation Techniques to Speed Up Aggregate Query Processing in Array Databases by Angelica´ Garc´ıa Gutierrez´ A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Approved, Thesis Committee: Prof. Dr. Peter Baumann Prof. Dr. Vikram Unnithan Prof. Dr. Ines´ Fernando Vega Lopez´ Date of Defense: November 12, 2010 School of Engineering and Science In memory of my grandmother, Naty. Acknowledgments I would like to express my sincere gratitude to my thesis advisor, Prof. Dr. Peter Baumann for his excellent guidance throughout the course of this dissertation. With his tremendous passion for science and his great efforts to explain things clearly and simply, he made this research to be one of the richest experiences of my life. He always suggested new ideas, and guided my research through many pitfalls. Fur- thermore, I learned from him to be kind and cooperative. Thank you, for every single meeting, for every single discussion that you always managed to be thought- provoking, for your continue encouragement, for believing in that I could bring this project to success. I am also grateful to Prof. Dr. Ines´ Fernando Vega Lopez´ for his valuable sugges- tions. He not only provided me with technical advice but also gave me some important hints on scientific writing that I applied on this dissertation. My sincere gratitude also to Prof. Dr. Vikram Unnithan. Despite being one of Jacobs University’s most pop- ular and busiest professors due to his genuine engagement with student life beyond academics, Prof. Unnithan took interest in this work and provided me unconditional support. -
Annual Report for 2005
In October 2005, Google piloted the Doodle 4 Google competition to help celebrate the opening of our new Googleplex offi ce in London. Students from local London schools competed, and eleven year-old Lisa Wainaina created the winning design. Lisa’s doodle was hosted on the Google UK homepage for 24 hours, seen by millions of people – including her very proud parents, classmates, and teachers. The back cover of this Annual Report displays the ten Doodle 4 Google fi nalists’ designs. SITTING HERE TODAY, I cannot believe that a year has passed since Sergey last wrote to you. Our pace of change and growth has been remarkable. All of us at Google feel fortunate to be part of a phenomenon that continues to rapidly expand throughout the world. We work hard to use this amazing expansion and attention to do good and expand our business as best we can. We remain an unconventional company. We are dedicated to serving our users with the best possible experience. And launching products early — involving users with “Labs” or “beta” versions — keeps us efficient at innovating. We manage Google with a long-term focus. We’re convinced that this is the best way to run our business. We’ve been consistent in this approach. We devote extraordinary resources to finding the smartest, most creative people we can and offering them the tools they need to change the world. Googlers know they are expected to invest time and energy on risky projects that create new opportunities to serve users and build new markets. Our mission remains central to our culture.