Data-Intensive Text Processing with Mapreduce

Total Page:16

File Type:pdf, Size:1020Kb

Data-Intensive Text Processing with Mapreduce Data-Intensive Text Processing with MapReduce Synthesis Lectures on Human Language Technologies Editor Graeme Hirst, University of Toronto Synthesis Lectures on Human Language Technologies is edited by Graeme Hirst of the University of Toronto.The series consists of 50- to 150-page monographs on topics relating to natural language processing, computational linguistics, information retrieval, and spoken language understanding. Emphasis is on important new techniques, on new applications, and on topics that combine two or more HLT subfields. Data-Intensive Text Processing with MapReduce Jimmy Lin and Chris Dyer 2010 Semantic Role Labeling Martha Palmer, Daniel Gildea, and Nianwen Xue 2010 Spoken Dialogue Systems Kristiina Jokinen and Michael McTear 2009 Introduction to Chinese Natural Language Processing Kam-Fai Wong, Wenjie Li, Ruifeng Xu, and Zheng-sheng Zhang 2009 Introduction to Linguistic Annotation and Text Analytics Graham Wilcock 2009 Dependency Parsing Sandra Kübler, Ryan McDonald, and Joakim Nivre 2009 Statistical Language Models for Information Retrieval ChengXiang Zhai 2008 Copyright © 2010 by Morgan & Claypool All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews, without the prior permission of the publisher. Data-Intensive Text Processing with MapReduce Jimmy Lin and Chris Dyer www.morganclaypool.com ISBN: 9781608453429 paperback ISBN: 9781608453436 ebook DOI 10.2200/S00274ED1V01Y201006HLT007 A Publication in the Morgan & Claypool Publishers series SYNTHESIS LECTURES ON HUMAN LANGUAGE TECHNOLOGIES Lecture #7 Series Editor: Graeme Hirst, University of Toronto Series ISSN Synthesis Lectures on Human Language Technologies Print 1947-4040 Electronic 1947-4059 Data-Intensive Text Processing with MapReduce Jimmy Lin and Chris Dyer University of Maryland SYNTHESIS LECTURES ON HUMAN LANGUAGE TECHNOLOGIES #7 &MC Morgan& cLaypool publishers ABSTRACT Our world is being revolutionized by data-driven methods: access to large amounts of data has gen- erated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a program- ming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance. This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains. This book not only intends to help the reader “think in MapReduce”, but also discusses limitations of the programming model as well. KEYWORDS Hadoop, parallel and distributed programming, algorithm design, text processing, nat- ural language processing, information retrieval, machine learning vii Contents Acknowledgments ...........................................................xi 1 Introduction .................................................................1 1.1 Computing in the Clouds ...................................................6 1.2 Big Ideas ..................................................................8 1.3 Why Is This Different? ....................................................13 1.4 What This Book Is Not ....................................................15 2 MapReduce Basics ..........................................................17 2.1 Functional Programming Roots .............................................18 2.2 Mappers and Reducers .....................................................20 2.3 The Execution Framework .................................................24 2.4 Partitioners and Combiners.................................................26 2.5 The Distributed File System ................................................28 2.6 Hadoop Cluster Architecture ...............................................33 2.7 Summary .................................................................34 3 MapReduce Algorithm Design ...............................................37 3.1 Local Aggregation .........................................................39 3.1.1 Combiners and In-Mapper Combining................................39 3.1.2 Algorithmic Correctness with Local Aggregation .......................43 3.2 Pairs and Stripes...........................................................47 3.3 Computing Relative Frequencies ............................................52 3.4 Secondary Sorting .........................................................57 3.5 Relational Joins ...........................................................58 3.5.1 Reduce-Side Join ....................................................60 3.5.2 Map-Side Join ......................................................62 viii 3.5.3 Memory-Backed Join ................................................63 3.6 Summary .................................................................63 4 Inverted Indexing for Text Retrieval ..........................................65 4.1 Web Crawling ............................................................66 4.2 Inverted Indexes ...........................................................68 4.3 Inverted Indexing: Baseline Implementation .................................69 4.4 Inverted Indexing: Revised Implementation ..................................72 4.5 Index Compression ........................................................74 4.5.1 Byte-Aligned and Word-Aligned Codes ...............................75 4.5.2 Bit-Aligned Codes ..................................................76 4.5.3 Postings Compression ...............................................78 4.6 What About Retrieval? ....................................................80 4.7 Summary and Additional Readings..........................................83 5 Graph Algorithms ...........................................................85 5.1 Graph Representations .....................................................87 5.2 Parallel Breadth-First Search ...............................................88 5.3 PageRank .................................................................95 5.4 Issues with Graph Processing ..............................................100 5.5 Summary and Additional Readings ........................................102 6 EM Algorithms for Text Processing .........................................105 6.1 Expectation Maximization ................................................108 6.1.1 Maximum Likelihood Estimation ...................................108 6.1.2 A Latent Variable Marble Game .....................................110 6.1.3 MLE with Latent Variables .........................................111 6.1.4 Expectation Maximization ..........................................112 6.1.5 An EM Example ...................................................113 6.2 Hidden Markov Models ..................................................114 6.2.1 Three Questions for Hidden Markov Models . 115 CONTENTS ix 6.2.2 The Forward Algorithm ............................................117 6.2.3 The Viterbi Algorithm .............................................118 6.2.4 Parameter Estimation for HMMs ...................................120 6.2.5 Forward-Backward Training: Summary ..............................125 6.3 EM in MapReduce .......................................................125 6.3.1 HMM Training in MapReduce ......................................126 6.4 Case Study: Word Alignment for Statistical Machine Translation . 130 6.4.1 Statistical Phrase-Based Translation..................................131 6.4.2 Brief Digression: Language Modeling with MapReduce . 133 6.4.3 Word Alignment ...................................................134 6.4.4 Experiments .......................................................135 6.5 EM-Like Algorithms .....................................................138 6.5.1 Gradient-Based Optimization and Log-Linear Models . 138 6.6 Summary and Additional Readings ........................................141 7 Closing Remarks ...........................................................143 7.1 Limitations of MapReduce ................................................143 7.2 Alternative Computing Paradigms .........................................145 7.3 MapReduce and Beyond ..................................................146 Bibliography ...............................................................149 Authors’ Biographies .......................................................165 Acknowledgments The first author is grateful to Esther and Kiri for their loving support. He dedicates this book to Joshua and Jacob, the new joys of his life. The second author would like to thank Herb for putting up with his disorderly living habits and Philip for being a very indulgent linguistics advisor. This work was made possible by the Google and IBM Academic Cloud Computing Initiative
Recommended publications
  • Velocity London 2018
    Building Resilient Serverless Systems @johnchapin | symphonia.io John Chapin • Currently Partner, Symphonia • Former VP Engineering, Technical Lead • Data Engineering and Data Science teams • 20+ yrs experience in govt, healthcare, travel, and ad-tech • Intent Media, RoomKey, Meddius, SAIC, Booz Allen Agenda • What is Serverless? • Resiliency • Demo • Discussion and Questions What is Serverless? Serverless = FaaS + BaaS! • FaaS = Functions as a Service • AWS Lambda, Auth0 Webtask, Azure Functions, Google Cloud Functions, etc... • BaaS = Backend as a Service • Auth0, Amazon DynamoDB, Google Firebase, Parse, Amazon S3, etc... go.symphonia.io/what-is-serverless Serverless benefits • Cloud benefits ++ • Reduced cost • Scaling flexibility • Shorter lead time go.symphonia.io/what-is-serverless Serverless attributes • No managing of hosts or processes • Self auto-scaling and provisioning • Costs based on precise usage (down to zero!) • Performance specified in terms other than host size/count • Implicit high availability, but not disaster recovery go.symphonia.io/what-is-serverless Resiliency “Failures are a given and everything will eventually fail over time ...” –Werner Vogels (https://www.allthingsdistributed.com/2016/03/10-lessons-from-10-years-of-aws.html) K.C. Green, Gunshow #648 Werner on Embracing Failure • Systems will fail • At scale, systems will fail a lot • Embrace failure as a natural occurrence • Limit the blast radius of failures • Keep operating • Recover quickly (automate!) Failures in Serverless land • Serverless is all about using vendor-managed services. • Two classes of failures: • Application failures (your problem, your resolution) • All other failures (your problem, but not your resolution) • What happens when those vendor-managed services fail? • Or when the services used by those services fail? Mitigation through architecture • No control over resolving acute vendor failures.
    [Show full text]
  • Deborah Lin Oral History Interview and Transcript
    Houston Asian American Archive (HAAA) Chao Center for Asian Studies, Rice University Interviewee: Deborah Lin Interviewer: Ann Shi Interview Date: November 11, 2020 Transcriber: Sonia He Reviewer: Ann Shi Track Time: 2:10:29 Background: Dr. Deborah Ho Lin, originally born in Taiwan, came to Atlanta, Georgia with her family when she was 17. She attended Georgia Tech for an undergraduate degree and later, Emory University for medical school. She is an accomplished pediatrician and writer, as well as the loving wife of Jimmy Lin and mother of Lara Lin. The family lived in New York since the couple got married, moved to Houston in 2008 when her husband obtained a professorship post at Rice University, and stayed here since. In this second interview with HAAA, Dr. Lin spoke of her childhood memories, her family and her role in the family as the eldest sister, and the important people in her lives during her upbringing. Dr. Lin also spoke briefly about her experience growing up in Georgia as the only non-caucasian family there 40 years ago. She reflected on her medical career as a woman of color in a male-dominated field, and her writing career during which, she covered the story of Nobel Prize winner, Chien-Shiung Wu’s story immediately after the award. Setting: This is a second interview of Dr. Deborah Ho Lin. Her first interview was taken in October 2014. This interview was conducted over the video conferencing software Zoom during the COVID-19 pandemic. Key: DL: Deborah Lin AS: Ann Shi —: speech cuts off; abrupt stop …: speech trails off; pause Italics: emphasis (?): preceding word may not be accurate [Brackets]: actions (laughs, sighs, etc.) Interview transcript: AS: Today is November 11, 2020, my name is Ann Shi.
    [Show full text]
  • How Are the Leaders of the Most Digitally Savvy Companies
    Executive Traits for Recognizing the Bountiful Opportunities Ahead CONCLUSION Author By Krishnan Ramanujam President, Business and Technology Services, Tata Consultancy Services Industries are being reordered today by the confluence of four technologies: cloud computing (which makes supercomputing power affordable even for startup firms), artificial intelligence (which ups the IQ of products, people and processes), big data and analytics (which turn operational chaos into coherency), and the internet of things (which lets us track products, people, customers, and premises around the clock/around the world). But why do so many companies ignore the transformation that is happening now around them? In their new book, MIT professors Andrew McAfee and Erik Brynjolfsson argue (as Kuhn did about scientific revolutions) that it is difficult to change long-accepted beliefs. “Existing processes, customers and suppliers, pools of expertise, and more general mindsets can all blind incumbents to things that should be obvious, such as the possibilities of new technologies that depart greatly from the status quo,” they write in Machine, Platform, Crowd: Harnessing Our Digital Future.70 It’s why “so many of the smartest and most experienced people and companies … [are] the least able to see” a transformation that they won’t escape. 70 Andrew McAfee and Erik Brynjolfsson, Machine, Platform, Crowd: Harnessing Our Digital Future (W.W. Norton & Company), published June 2017. http://books. wwnorton.com/books/Machine-Platform-Crowd/ Accessed July 28, 2017. 91 Is such blindness avoidable? I think so. To most successful companies of the last do so, senior executives need to sharpen 10 years—Apple, Amazon, and Netflix— or develop five traits that may have rapidly recognize the potential of AI and sat dormant in them, but which I think automation, cloud computing, IoT and reside in all of us.
    [Show full text]
  • AWS with Corey Quinn
    SED 745 Transcript EPISODE 745 [INTRODUCTION] [00:00:00] JM: Amazon Web Services changed how software engineers work. Before AWS, it was common for startups to purchase their own physical servers. AWS made server resources as accessible as an API request, and AWS has gone on to create higher-level abstractions for building applications. For the first few years of AWS, the abstractions were familiar. S3 provided distributed reliable objects storage. Elastic MapReduce provided a managed cloud Hadoop system. Kinesis provided a scalable queuing system. Amazon was providing developers with managed alternatives to complicated open source software. More recently, AWS has started to release products that are completely novel. They’re unlike anything else. A perfect example is AWS Lambda, the first function as a service platform. Other newer AWS products include Ground Station, which is a service for processing satellite data; and AWS DeepRacer, a miniature race car for developers to build and test machine learning algorithms on. As AWS has grown into new categories, the blog announcements for new services and features have started coming so frequently that is hard to keep track of it all. Corey Quinn is the author of Last Week in AWS, a popular newsletter about what is changing across Amazon Web Services. Corey joins the show today to give his perspective on the growing shifting behemoth that is Amazon Web Services as well as the other major cloud providers that have risen to prominence. Corey is the host of the Screaming in the Cloud podcast, which you should check out if you like this episode.
    [Show full text]
  • K12 Education Leaders! on Behalf of AWS, We Are Excited for You to Join Us for AWS Re:Invent 2020!
    © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Welcome K12 Education Leaders! On behalf of AWS, we are excited for you to join us for AWS re:Invent 2020! This year’s virtual conference is going to be the industry event of the year. It is free of charge and will offer 5 Keynotes, 18 Leadership Sessions, and unlimited access to hundreds of sessions. We are happy to have you along for the journey! To help you prepare and be ready for this 3-week event, we’ve created this guide of suggested sessions and keynotes for K12 education. This is only a starting point. There is much more content on the AWS re:Invent website. Step 1: Register for re:Invent Once you register you will be able to search all 500+ sessions. Step 2: Register for Edu Rewind AWS education has also created weekly re:Invent Rewind sessions for education organizations to discuss announcements and takeaways throughout the event. Step 3: Follow @AWS_EDU on Twitter to join the conversation. We look forward to seeing you there! Learn 500+ sessions covering core services and emerging technologies will be available at your fingertips over 3 weeks. With hours of content to comb through, our team recommends the following sessions for K12 education professionals. K12 Session Recommendations LEADERSHIP TALKS Reimagine business applications from the ground up BIZ291-L DEC 1, 2020 | 2:15 PM - 3:15 PM EST Harness the power of data with AWS Analytics ANT291-L DEC 9, 2020 | 12:15 PM - 1:15 PM EST AWS security: Where we’ve been, where we’re going SEC291-L DEC 8, 2020
    [Show full text]
  • Analyzing Dynamic Capabilities in the Context of Cloud Platform Ecosystems - a Case Study Approach
    Junior Management Science 2(3) (2017) 124-172 Volume 2, Issue 3, December 2017 Advisory Editorial Board: DOMINIK VAN AAKEN JUNIOR FREDERIK AHLEMANN CHRISTOPH BODE ROLF BRÜHL MANAGEMENT JOACHIM BÜSCHKEN LEONHARD DOBUSCH RALF ELSAS DAVID FLORYSIAK SCIENCE GUNTHER FRIEDL WOLFGANG GÜTTEL CHRISTIAN HOFMANN Yasmin Le, The State of the Art in Cryptocurrencies 1 KATJA HUTTER LUTZ JOHANNING Fabian Müller, Einfluss von Commitment und Affekten STEPHAN KAISER auf das Investitionsverhalten in Projekten 11 ALFRED KIESER NATALIA KLIEWER Marcus Pfeiffer, Biases bei betriebswirtschaftlichen Junior Management Science Entscheidungen in Großprojekten und DODO ZU KNYPHAUSEN-AUFSEß SABINE T. KÖSZEGI Lösungsansätze: Aktueller Stand der Theorie ARJAN KOZICA und Empirie 48 TOBIAS KRETSCHMER HANS-ULRICH KÜPPER Simon Hux, Ankereffekt und Risikoprämie anhand einer REINER LEIDL Crowdfunding-Kampagne 73 ANTON MEYER GORDON MÜLLER-SEITZ Anastasia Kieliszek, Corporate Divestment Decision GÜNTER MÜLLER-STEWENS Factors: A Systematic Review 104 BURKHARD PEDELL MARCEL PROKOPCZUK Kevin Rudolph, Analyzing Dynamic Capabilities in the TANJA RABL Context of Cloud Platform Ecosystems - A Case SASCHA RAITHEL Study Approach 124 ASTRID REICHEL KATJA ROST MARKO SARSTEDT DEBORAH SCHANZ ANDREAS G. SCHERER STEFAN SCHMID UTE SCHMIEL CHRISTIAN SCHMITZ PHILIPP SCHRECK GEORG SCHREYÖGG LARS SCHWEIZER DAVID SEIDL journal homepage: www.jums.academy THORSTEN SELLHORN ANDREAS SUCHANEK ORESTIS TERZIDIS ANJA TUSCHKE SABINE URNIK STEPHAN WAGNER BARBARA E. WEIßENBERGER ISABELL M. WELPE HANNES WINNER CLAUDIA B. WÖHLE THOMAS WRONA THOMAS ZWICK Published by Junior Management Science e. V. Analyzing Dynamic Capabilities in the Context of Cloud Platform Ecosystems - A Case Study Approach Kevin Rudolph Technische Universität Berlin Abstract Dynamic capabilities (DCs) refer to a firm’s abilities to continuously adapt its resource base in order to respond to changes in its external environment.
    [Show full text]
  • Download.Oracle.Com/Docs/Cd/B28359 01/Network.111/B28530/Asotr Ans.Htm
    SCUOLA DI DOTTORATO IN INFORMATICA Tesi di Dottorato XXIV Ciclo A Distributed Approach to Privacy on the Cloud Francesco Pagano Relatore: Prof. Ernesto Damiani Correlatore: Prof. Stelvio Cimato Direttore della Scuola di Dottorato in Informatica: Prof. Ernesto Damiani Anno Accademico 2010/2011 Abstract The increasing adoption of Cloud-based data processing and storage poses a number of privacy issues. Users wish to preserve full control over their sensitive data and cannot accept it to be fully accessible to an external storage provider. Previous research in this area was mostly addressed at techniques to protect data stored on untrusted database servers; however, I argue that the Cloud architecture presents a number of specific problems and issues. This dissertation contains a detailed analysis of open issues. To handle them, I present a novel approach where confidential data is stored in a highly distributed partitioned database, partly located on the Cloud and partly on the clients. In my approach, data can be either private or shared; the latter is shared in a secure manner by means of simple grant-and-revoke permissions. I have developed a proof-of-concept implementation using an in-memory RDBMS with row-level data encryption in order to achieve fine-grained data access control. This type of approach is rarely adopted in conventional outsourced RDBMSs because it requires several complex steps. Benchmarks of my proof- of-concept implementation show that my approach overcomes most of the problems. 2 Acknowledgements I want to thank
    [Show full text]
  • Crawling Frontier Controls
    Nutch – ApacheCon US '09 Web-scale search engine toolkit search Web-scale Today and tomorrow Today Apache Andrzej Białecki [email protected] Nutch – ApacheCon US '09 • • • Questions answers and future Nutch present and solutions)some (and Challenges Nutchworkflow: Nutcharchitecture overview general Web in crawling project the About Searching Crawling Setup Agenda 2 Nutch – ApacheCon US '09 • • Collections typically 1 mln - 200 mln documents mln Collections -typically 200 1 mln search mostly vertical in operation, installations Many Spin-offs: (sub-project Lucene) of Apache project since 2004 Mike Cafarella creator, and Lucene bythe Cutting, Doug 2003 in Founded Content type detection and parsing Tika → Map-Reduce and distributed → Hadoop FS Apache Nutch project 3 Nutch – ApacheCon US '09 4 Nutch – ApacheCon US '09 first, random Traversal: depth- breadth-first, edges, the follow listsas Oftenadjacency represented (neighbor) <alabels: href=”..”>anchor Edge text</a> Edges (links): hyperlinks like <a href=”targetUrl”/> Nodes (vertices):URL-s identifiers as unique 6 2 8 1 3 Web as a directed graph 5 4 7 9 7 →3, 4, 8, 9 5 →6, 9 1 →2, 3, 4, 5, 6 5 Nutch – ApacheCon US '09 … What's in a search engine? a fewa things may surprisethat you! 6 Nutch – ApacheCon US '09 Injector -links(in/out) - Web graph pageinfo Search engine building blocks Scheduler Updater Crawling frontierCrawling controls Crawler repository Content Searcher Indexer Parser 7 Nutch – ApacheCon US '09 Robust API and integration options Robust APIintegration and Full-text&indexer search engine processingdata framework Scalable Robustcontrols frontier crawling processing (parsing, content filtering) Plugin-based crawler distributed multi-threaded, Multi-protocol, modular: highly Plugin-based, graph) (web link database and database Page − − − − Support Support for search distributed or Using Lucene Solr Map-reduce processing Mostvia plugins be behavior can changed Nutch features at a glance 8 Hadoop foundation File system abstraction • Local FS, or • Distributed FS − also Amazon S3, Kosmos and other FS impl.
    [Show full text]
  • The Google Search Engine
    University of Business and Technology in Kosovo UBT Knowledge Center Theses and Dissertations Student Work Summer 6-2010 The Google search engine Ganimete Perçuku Follow this and additional works at: https://knowledgecenter.ubt-uni.net/etd Part of the Computer Sciences Commons Faculty of Computer Sciences and Engineering The Google search engine (Bachelor Degree) Ganimete Perçuku – Hasani June, 2010 Prishtinë Faculty of Computer Sciences and Engineering Bachelor Degree Academic Year 2008 – 2009 Student: Ganimete Perçuku – Hasani The Google search engine Supervisor: Dr. Bekim Gashi 09/06/2010 This thesis is submitted in partial fulfillment of the requirements for a Bachelor Degree Abstrakt Përgjithësisht makina kërkuese Google paraqitet si sistemi i kompjuterëve të projektuar për kërkimin e informatave në ueb. Google mundohet t’i kuptojë kërkesat e njerëzve në mënyrë “njerëzore”, dhe t’iu kthej atyre përgjigjen në formën të qartë. Por, ky synim nuk është as afër ideales dhe realizimi i tij sa vjen e vështirësohet me zgjerimin eksponencial që sot po përjeton ueb-i. Google, paraqitet duke ngërthyer në vetvete shqyrtimin e pjesëve që e përbëjnë, atyre në të cilat sistemi mbështetet, dhe rrethinave tjera që i mundësojnë sistemit të funksionojë pa probleme apo të përtërihet lehtë nga ndonjë dështim eventual. Procesi i grumbullimit të të dhënave ne Google dhe paraqitja e tyre në rezultatet e kërkimit ngërthen në vete regjistrimin e të dhënave nga ueb-faqe të ndryshme dhe vendosjen e tyre në rezervuarin e sistemit, përkatësisht në bazën e të dhënave ku edhe realizohen pyetësorët që kthejnë rezultatet e radhitura në mënyrën e caktuar nga algoritmi i Google.
    [Show full text]
  • Failure Modes and Effects Analysis (FMEA)
    A R C 3 3 5 - R Designing for failure: Architecting resilient systems on AWS Adrian Cockcroft Harsha Nippani Vinay Kola VP, Cloud Architecture Solutions Architect Software Engineer Amazon Web Services Amazon Web Services Snap Inc. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda • Risk and resilience • Technical considerations • Customer use case: Snap • Continuous resilience • Related sessions • AWS whitepaper © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Everything fails, all the time” - Werner Vogels (CTO, Amazon.com) Business continuity How much data can you afford to How quickly must you recover? recreate or lose? What is the cost of downtime? Disaster Recovery point (RPO) Recovery time (RTO) Data loss Downtime mission Availability by the numbers Level of availability Percent uptime Downtime per year Downtime per day 1 Nine 90% 36.5 Days 2.4 Hours 2 Nines 99% 3.65 Days 14 Minutes 3 Nines 99.9% 8.76 Hours 86 Seconds 4 Nines 99.99% 52.6 Minutes 8.6 Seconds 5 Nines 99.999% 5.26 Minutes 0.86 Seconds Daily Downtime in Seconds 1 Nine 2 Nines 3 Nines 4 Nines 5 Nines 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Daily Downtime in Seconds Multi-AZ architecture Region • Enables fault-tolerant applications Availability Zone • AWS Regional services designed to withstand AZ failures • Leveraged in the Amazon S3 design for 99.999999999% durability Availability Zone Availability Zone Multi-AZ Zero blast radius! Well-Architected Framework AWS Shared Responsibility Model Resilient AWS infrastructure
    [Show full text]
  • Web Crawling, Analysis and Archiving
    Web Crawling, Analysis and Archiving Vangelis Banos Aristotle University of Thessaloniki Faculty of Sciences School of Informatics Doctoral dissertation under the supervision of Professor Yannis Manolopoulos October 2015 Ανάκτηση, Ανάλυση και Αρχειοθέτηση του Παγκόσμιου Ιστού Ευάγγελος Μπάνος Αριστοτέλειο Πανεπιστήμιο Θεσσαλονίκης Σχολή Θετικών Επιστημών Τμήμα Πληροφορικής Διδακτορική Διατριβή υπό την επίβλεψη του Καθηγητή Ιωάννη Μανωλόπουλου Οκτώβριος 2015 i Web Crawling, Analysis and Archiving PhD Dissertation ©Copyright by Vangelis Banos, 2015. All rights reserved. The Doctoral Dissertation was submitted to the the School of Informatics, Faculty of Sci- ences, Aristotle University of Thessaloniki. Defence Date: 30/10/2015. Examination Committee Yannis Manolopoulos, Professor, Department of Informatics, Aristotle University of Thes- saloniki, Greece. Supervisor Apostolos Papadopoulos, Assistant Professor, Department of Informatics, Aristotle Univer- sity of Thessaloniki, Greece. Advisory Committee Member Dimitrios Katsaros, Assistant Professor, Department of Electrical & Computer Engineering, University of Thessaly, Volos, Greece. Advisory Committee Member Athena Vakali, Professor, Department of Informatics, Aristotle University of Thessaloniki, Greece. Anastasios Gounaris, Assistant Professor, Department of Informatics, Aristotle University of Thessaloniki, Greece. Georgios Evangelidis, Professor, Department of Applied Informatics, University of Mace- donia, Greece. Sarantos Kapidakis, Professor, Department of Archives, Library Science and Museology, Ionian University, Greece. Abstract The Web is increasingly important for all aspects of our society, culture and economy. Web archiving is the process of gathering digital materials from the Web, ingesting it, ensuring that these materials are preserved in an archive, and making the collected materials available for future use and research. Web archiving is a difficult problem due to organizational and technical reasons. We focus on the technical aspects of Web archiving.
    [Show full text]
  • Detecting Malicious Websites with Low-Interaction Honeyclients
    Monkey-Spider: Detecting Malicious Websites with Low-Interaction Honeyclients Ali Ikinci Thorsten Holz Felix Freiling University of Mannheim Mannheim, Germany Abstract: Client-side attacks are on the rise: malicious websites that exploit vulner- abilities in the visitor’s browser are posing a serious threat to client security, compro- mising innocent users who visit these sites without having a patched web browser. Currently, there is neither a freely available comprehensive database of threats on the Web nor sufficient freely available tools to build such a database. In this work, we in- troduce the Monkey-Spider project [Mon]. Utilizing it as a client honeypot, we portray the challenge in such an approach and evaluate our system as a high-speed, Internet- scale analysis tool to build a database of threats found in the wild. Furthermore, we evaluate the system by analyzing different crawls performed during a period of three months and present the lessons learned. 1 Introduction The Internet is growing and evolving every day. More and more people are becoming part of the so-called Internet community. With this growth, also the amount of threats for these people is increasing. Online criminals who want to destroy, cheat, con others, or steal goods are evolving rapidly [Ver03]. Currently, there is no comprehensive and free database to study malicious websites found on the Internet. Malicious websites are websites which have any kind of content that could be a threat for the security of the clients requesting these sites. For example, a malicious website could exploit a vulnerability in the visitor’s web browser and use this to compromise the system and install malware on it.
    [Show full text]