Data-Intensive Text Processing with Mapreduce
Total Page:16
File Type:pdf, Size:1020Kb
Data-Intensive Text Processing with MapReduce Synthesis Lectures on Human Language Technologies Editor Graeme Hirst, University of Toronto Synthesis Lectures on Human Language Technologies is edited by Graeme Hirst of the University of Toronto.The series consists of 50- to 150-page monographs on topics relating to natural language processing, computational linguistics, information retrieval, and spoken language understanding. Emphasis is on important new techniques, on new applications, and on topics that combine two or more HLT subfields. Data-Intensive Text Processing with MapReduce Jimmy Lin and Chris Dyer 2010 Semantic Role Labeling Martha Palmer, Daniel Gildea, and Nianwen Xue 2010 Spoken Dialogue Systems Kristiina Jokinen and Michael McTear 2009 Introduction to Chinese Natural Language Processing Kam-Fai Wong, Wenjie Li, Ruifeng Xu, and Zheng-sheng Zhang 2009 Introduction to Linguistic Annotation and Text Analytics Graham Wilcock 2009 Dependency Parsing Sandra Kübler, Ryan McDonald, and Joakim Nivre 2009 Statistical Language Models for Information Retrieval ChengXiang Zhai 2008 Copyright © 2010 by Morgan & Claypool All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews, without the prior permission of the publisher. Data-Intensive Text Processing with MapReduce Jimmy Lin and Chris Dyer www.morganclaypool.com ISBN: 9781608453429 paperback ISBN: 9781608453436 ebook DOI 10.2200/S00274ED1V01Y201006HLT007 A Publication in the Morgan & Claypool Publishers series SYNTHESIS LECTURES ON HUMAN LANGUAGE TECHNOLOGIES Lecture #7 Series Editor: Graeme Hirst, University of Toronto Series ISSN Synthesis Lectures on Human Language Technologies Print 1947-4040 Electronic 1947-4059 Data-Intensive Text Processing with MapReduce Jimmy Lin and Chris Dyer University of Maryland SYNTHESIS LECTURES ON HUMAN LANGUAGE TECHNOLOGIES #7 &MC Morgan& cLaypool publishers ABSTRACT Our world is being revolutionized by data-driven methods: access to large amounts of data has gen- erated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a program- ming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance. This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains. This book not only intends to help the reader “think in MapReduce”, but also discusses limitations of the programming model as well. KEYWORDS Hadoop, parallel and distributed programming, algorithm design, text processing, nat- ural language processing, information retrieval, machine learning vii Contents Acknowledgments ...........................................................xi 1 Introduction .................................................................1 1.1 Computing in the Clouds ...................................................6 1.2 Big Ideas ..................................................................8 1.3 Why Is This Different? ....................................................13 1.4 What This Book Is Not ....................................................15 2 MapReduce Basics ..........................................................17 2.1 Functional Programming Roots .............................................18 2.2 Mappers and Reducers .....................................................20 2.3 The Execution Framework .................................................24 2.4 Partitioners and Combiners.................................................26 2.5 The Distributed File System ................................................28 2.6 Hadoop Cluster Architecture ...............................................33 2.7 Summary .................................................................34 3 MapReduce Algorithm Design ...............................................37 3.1 Local Aggregation .........................................................39 3.1.1 Combiners and In-Mapper Combining................................39 3.1.2 Algorithmic Correctness with Local Aggregation .......................43 3.2 Pairs and Stripes...........................................................47 3.3 Computing Relative Frequencies ............................................52 3.4 Secondary Sorting .........................................................57 3.5 Relational Joins ...........................................................58 3.5.1 Reduce-Side Join ....................................................60 3.5.2 Map-Side Join ......................................................62 viii 3.5.3 Memory-Backed Join ................................................63 3.6 Summary .................................................................63 4 Inverted Indexing for Text Retrieval ..........................................65 4.1 Web Crawling ............................................................66 4.2 Inverted Indexes ...........................................................68 4.3 Inverted Indexing: Baseline Implementation .................................69 4.4 Inverted Indexing: Revised Implementation ..................................72 4.5 Index Compression ........................................................74 4.5.1 Byte-Aligned and Word-Aligned Codes ...............................75 4.5.2 Bit-Aligned Codes ..................................................76 4.5.3 Postings Compression ...............................................78 4.6 What About Retrieval? ....................................................80 4.7 Summary and Additional Readings..........................................83 5 Graph Algorithms ...........................................................85 5.1 Graph Representations .....................................................87 5.2 Parallel Breadth-First Search ...............................................88 5.3 PageRank .................................................................95 5.4 Issues with Graph Processing ..............................................100 5.5 Summary and Additional Readings ........................................102 6 EM Algorithms for Text Processing .........................................105 6.1 Expectation Maximization ................................................108 6.1.1 Maximum Likelihood Estimation ...................................108 6.1.2 A Latent Variable Marble Game .....................................110 6.1.3 MLE with Latent Variables .........................................111 6.1.4 Expectation Maximization ..........................................112 6.1.5 An EM Example ...................................................113 6.2 Hidden Markov Models ..................................................114 6.2.1 Three Questions for Hidden Markov Models . 115 CONTENTS ix 6.2.2 The Forward Algorithm ............................................117 6.2.3 The Viterbi Algorithm .............................................118 6.2.4 Parameter Estimation for HMMs ...................................120 6.2.5 Forward-Backward Training: Summary ..............................125 6.3 EM in MapReduce .......................................................125 6.3.1 HMM Training in MapReduce ......................................126 6.4 Case Study: Word Alignment for Statistical Machine Translation . 130 6.4.1 Statistical Phrase-Based Translation..................................131 6.4.2 Brief Digression: Language Modeling with MapReduce . 133 6.4.3 Word Alignment ...................................................134 6.4.4 Experiments .......................................................135 6.5 EM-Like Algorithms .....................................................138 6.5.1 Gradient-Based Optimization and Log-Linear Models . 138 6.6 Summary and Additional Readings ........................................141 7 Closing Remarks ...........................................................143 7.1 Limitations of MapReduce ................................................143 7.2 Alternative Computing Paradigms .........................................145 7.3 MapReduce and Beyond ..................................................146 Bibliography ...............................................................149 Authors’ Biographies .......................................................165 Acknowledgments The first author is grateful to Esther and Kiri for their loving support. He dedicates this book to Joshua and Jacob, the new joys of his life. The second author would like to thank Herb for putting up with his disorderly living habits and Philip for being a very indulgent linguistics advisor. This work was made possible by the Google and IBM Academic Cloud Computing Initiative