Discourse Models for Collaboratively Edited Corpora
Total Page:16
File Type:pdf, Size:1020Kb
Discourse Models for Collaboratively Edited Corpora by Erdong Chen Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2008 © Massachusetts Institute of Technology 2008. All rights reserved. Author................................................................... Department of Electrical Engineering and Computer Science May 23, 2008 Certified by . Regina Barzilay Associate Professor Thesis Supervisor Accepted by . Arthur C. Smith Chairman, Department Committee on Graduate Students 2 Discourse Models for Collaboratively Edited Corpora by Erdong Chen Submitted to the Department of Electrical Engineering and Computer Science on May 23, 2008, in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering Abstract This thesis focuses on computational discourse models for collaboratively edited corpora. Due to the exponential growth rate and significant stylistic and content variations of collaboratively edited corpora, models based on professionally edited texts are incapable of processing the new data effectively. For these methods to succeed, one challenge is to preserve the local coherence as well as global consistence. We explore two corpus-based methods for processing collaboratively edited corpora, which effectively model and optimize the consistence of user generated text. The first method ad- dresses the task of inserting new information into existing texts. In particular, we wish to determine the best location in a text for a given piece of new information. We present an online ranking model which exploits this hierarchical structure – representationally in its features and algorithmically in its learning procedure. When tested on a corpus of Wikipedia articles, our hierarchically informed model predicts the correct insertion paragraph more accurately than baseline methods. The sec- ond method concerns inducing a common structure across multiple articles in similar domains to aid cross document collaborative editing. A graphical model is designed to induce section topics and to learn topic clusters. Some preliminary experiments showed that the proposed method is comparable to baseline methods. Thesis Supervisor: Regina Barzilay Title: Associate Professor 3 4 Acknowledgments I would like to gratefully acknowledge my supervisor, Professor Regina Barzilay, for her support and guidance. I have been greatly impressed by her enthusiasm to scientific research and compre- hensive knowledge, and benefited from her patience and carefulness. Thanks to Professor Michael Collins and Professor Samuel Madden for their helpful inspiration. Some of the data used in this work was collected and processed by Christina Sauper, and I am also grateful to her generous help. Many thanks are given to to Serdar Balci, S.R.K. Branavan, Eugene Charniak, Harr Chen, Michael Collins, Pawan Deshpande, Micha Elsner, Jacob Eisenstein, Dina Katabi, Igor Malioutov, Alvin Raj, Christina Sauper, Benjamin Synder, Jenny Yuen, and Luke Zettlemoyer. Lastly, I want to give my special thanks to people in Natural Language Processing Group and Spoken Language Systems Group in MIT who offered me help for this thesis. 5 Bibliographic Notes Portions of this thesis are based on the following paper(s): ”Incremental Text Structuring with Online Hierarchical Ranking”, by Erdong Chen, Benjamin Snyder, and Regina Barzilay, the paper appeared in Joint Meeting of Conference on Empirical Methods in Natural Language Processing and Conference on Computational Natural Language Learning(EMNLP-CoNLL) 2007. 6 Contents 1 Introduction 15 1.1 Overview . 15 1.2 Incremental Text Structuring . 19 1.3 Common Structure Induction . 21 1.4 Key Contributions . 21 1.4.1 Incremental Text Structuring . 22 1.4.2 Hierarchically Sensitive Training . 22 1.4.3 Common Structure Induction . 23 1.4.4 NLP on Collaboratively Edited Corpora . 23 1.5 Scope of the Thesis . 23 2 Related Work 25 2.1 Incremental Text Structruing . 25 2.1.1 Symbolic Concept-to-text Generation . 26 2.1.2 Statistical Methods on Sentence Ordering . 27 2.1.3 Hierarchical Learning . 27 2.2 Common Structure Induction . 29 2.2.1 Bayesian Topic Modeling . 29 2.2.2 Semantic Wikipedia . 31 2.2.3 Summary . 31 7 3 Incremental Text Structuring 33 3.1 Overview . 33 3.2 The Algorithm . 36 3.2.1 Problem Formulation . 36 3.2.2 The Model . 37 3.2.3 Training . 38 3.3 Features . 40 3.3.1 Lexical Features . 40 3.3.2 Positional Features . 41 3.3.3 Temporal Features . 41 3.4 Experimental Set-Up . 43 3.4.1 Corpus . 43 3.4.2 Evaluation Measures . 45 3.4.3 Baselines . 45 3.4.4 Human Performance . 47 3.5 Results . 47 3.5.1 Sentence-level Evaluation . 50 3.6 Conclusion . 50 4 Common Structure Induction 51 4.1 Overview . 51 4.2 Problem Formulation . 53 4.3 Model . 53 4.4 Cross-document Structure Analysis . 53 4.4.1 Generative Process . 54 4.4.2 Learning & Inference . 56 4.5 Experiments . 58 4.6 Data . 58 4.6.1 Evaluation Measures . 58 8 4.6.2 Baselines . 59 4.6.3 Preliminary results . 60 5 Conclusions and Future Work 61 A Examples of Sentence Insertions on Wikipedia 63 B Examples of Document Structure and Section Titles on Wikipedia 71 9 10 List of Figures 1-1 A Wikipedia article about Barack Obama in June 2007. 16 1-2 An example of Wikipedia insertion. 19 1-3 Screenshot of an insertion between two consecutive revisions on a Wikipedia arti- cle on Chef Tony. An insertion sentence is shown in red. 20 3-1 Screenshot of an example insertion between two consecutive revisions on a Wikipedia article on Raymond Chen. An insertion sentence is shown in red. 35 3-2 Training algorithm for the hierarchical ranking model. 38 3-3 An example of a tree with the corresponding model scores. The path surrounded by solid lines leads to the correct node `1. The path surrounded by dotted lines leads to `3, the predicted output based on the current model. 39 3-4 A selected part of the revision history on a Wikipedia article about Barack Obama. 44 3-5 Screenshot of an example insertion with two legitimate insertion points. An inser- tion sentence is shown in red. 48 4-1 Topic transition of section titles (Mutually-clustered section titles). 55 4-2 Training algorithm for the graphical model. 55 4-3 Graphical Model. 56 A-1 Screenshot of an insertion between two consecutive revisions on a Wikipedia arti- cle on Santiago Calatrava. An insertion sentence is shown in red. 64 A-2 Screenshot of an insertion between two consecutive revisions on a Wikipedia arti- cle on Amanda Bynes. An insertion sentence is shown in red. 65 11 A-3 Screenshot of an insertion between two consecutive revisions on a Wikipedia arti- cle on Chen Lu (part 1). An insertion sentence is shown in red. 66 A-4 Screenshot of an insertion between two consecutive revisions on a Wikipedia arti- cle on Chen Lu (part 2). An insertion sentence is shown in red. 67 A-5 Screenshot of an insertion between two consecutive revisions on a Wikipedia arti- cle on CM Punk (part 1). An insertion sentence is shown in red. 68 A-6 Screenshot of an insertion between two consecutive revisions on a Wikipedia arti- cle on CM Punk (part 2). An insertion sentence is shown in red. 69 A-7 Screenshot of an insertion between two consecutive revisions on a Wikipedia arti- cle on Chris Chelios. An insertion sentence is shown in red. 70 12 List of Tables 1.1 42 different Wikipedia section titles similar to Early life. 22 3.1 List of sample features ( given an insertion sentence sen, and a paragraph p from a section sec of a document d.) ........................... 42 3.2 Accuracy of human insertions compared against gold standard from Wikipedia’s update log. T1 is a subset of the data annotated by judges J1 and J2, while T2 is annotated by J3 and J4. 43 3.3 A list of sample insertion sentences from our Wikipedia dataset . 46 3.4 Accuracy of automatic insertion methods compared against the gold standard from Wikipedia’s update log. The third column gives tree distance, where a lower score corresponds to better performance. Diacritic * (p < 0:01) indicates whether differ- ences in accuracy between the given model and the Hierarchical model is signifi- cant (using a Fisher Sign Test). 49 4.1 Section titles in sample documents of our corpus. 54 4.2 Seven topic clusters that are agreed between judges. 59 4.3 Performance of various methods and baselines against a gold standard from human annotated clusters. 60 B.1 Section titles in sample documents of our corpus (part 1). 71 B.2 Section titles in sample documents of our corpus (part 2). 72 B.3 Section titles in sample documents of our corpus (part 3). 73 B.4 Distinct section titles in our corpus (part 1). 74 13 B.5 Distinct section titles in our corpus (part 2). 75 14 Chapter 1 Introduction 1.1 Overview Barack Obama is a Democratic politician from Illinois. He is currently running for the United States Senate, which would be the highest elected office he has held thus far. Biography Obama’s father is Kenyan; his mother is from Kansas. He himself was born in Hawaii, where his mother and father met at the University of Hawaii. Obama’s father left his family early on, and Obama was raised in Hawaii by his mother. – Wikipedia article about Barack Obama on March 18, 2004 The first Wikipedia1 article about Barack Obama appeared on March 18, 2004. In June 2007, when he was running for U.S. president, this article grew to more than 400 sentences after more than five thousand revisions.