Neural Discourse Modeling

Neural Discourse Modeling A Dissertation Presented to The Faculty of the Graduate School of Arts and Sciences Brandeis University Computer Science Nianwen Xue, Advisor In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy by Attapol Thamrongrattanarit Rutherford August, 2016 This dissertation, directed and approved by Attapol Thamrongrattanarit Rutherford's committee, has been accepted and approved by the Graduate Faculty of Brandeis University in partial fulfillment of the requirements for the degree of: DOCTOR OF PHILOSOPHY Eric Chasalow, Dean Graduate School of Arts and Sciences Dissertation Committee: Nianwen Xue, Chair James Pustejovsky, Dept. of Computer Science Benjamin Wellner, Dept. of Computer Science Vera Demberg, Universitätdes Saarlandes c Copyright by Attapol Thamrongrattanarit Rutherford 2016 Acknowledgments This dissertation is the most difficult thing I have ever done in my life. As lonely as a dissertation might seem, I manage to get to this point thanks to the help and support from many people. Bert obviously for all your advice and support. And also for putting up with me all these years. I know I am not the easiest student to deal with in many ways, but you calm me down and help me grow up as an adult as a researcher. I cannot thank you enough for that. John for keeping me sane and really being there for me. Time and time again you rescue me from hitting the PhD emotional rock bottom. Thanks for being the awesome person that you are. I would have been a wreck without you. Vera for a new perspective on the the work and the world. My time at Saarland com- pletely changes the way I think about the discourse analysis and computational linguistics in general. The opportunity to meet with the brilliant colleagues in Germany and the opportunity to explore Europe from the Saarbrücken base camp are simply priceless. James and Ben for being on the committee and touching base with me throughout the process. Yaqin, Chuan, and Yuchen for being supportive lab mates. I appreciate both moral and intellectual support through these years. iv Abstract Neural Discourse Modeling A dissertation presented to the Faculty of the Graduate School of Arts and Sciences of Brandeis University, Waltham, Massachusetts by Attapol Thamrongrattanarit Rutherford Sentences rarely stand in isolation, and the meaning of the text is always more than the sum of its individual sentences. Sentences form a discourse and require the analysis beyond the sentence level. In the analysis of discourse, we must identify discourse relations that hold between a pair of textual spans. This is a simple task when discourse connectives such because or therefore are there to provide cues, but the relation must be inferred from the text alone when discourse connectives are omitted, which happens very frequently in English. Discourse parser may be used as another preprocessing step that enhances downstream tasks such as automatic essay grading, machine translation, information extraction, and natural language understanding in general. In this thesis, we propose a neural discourse parser which addresses computational and linguistic issues that arise from implicit discourse relation classification in the Penn Dis- course Treebank. We must build a model that captures the inter-argument interaction without making the feature space too sparse. We shift from the traditional discrete feature paradigm to the distributed continuous feature paradigm, also known as neural network modeling. Discrete features used in previous work reach their performance ceiling because of the sparsity problem. They also cannot be extended to other domains or languages due their dependency on specific language resources and lexicons. Our approaches are language- independent and generalizable features, which reduce the feature set size by one or two orders of magnitude. Furthermore, we deal with the data sparsity problem directly through v linguistically-motivated selection criteria for eligible artificial implicit discourse relations to supplement the manually annotated training set. Learning from these insights, we design the neural network architecture from the ground up to understand each subcomponent in discourse modeling and to find the model that performs the best in the task. We found that high-dimensional word vectors trained on large corpora and compact inter-argument modeling are integral for neural discourse models. Our results suggest simple feedforward model with the summation of word vectors as features can be more effective and robust than the recurrent neural network variant. Lastly, we extend these methods to different label sets and to Chinese discourse parsing to show that our discourse parser is robust and language-independent. vi Contents Abstract v 1 Introduction 3 1.1 Motivation . .5 1.2 Dissertation Contributions . .6 1.3 Robust Light-weight Multilingual Neural Discourse Parser . 10 2 Discourse Parsing and Neural Network Modeling in NLP 11 2.1 Discourse Analysis . 11 2.2 Modeling Implicit Discourse Relations in the PDTB . 18 2.3 Neural Network Modeling . 20 3 Brown Cluster Pair and Coreference Features 31 3.1 Sense annotation in Penn Discourse Treebank . 33 3.2 Experiment setup . 34 3.3 Feature Description . 35 3.4 Results . 39 3.5 Feature analysis . 40 3.6 Related work . 45 3.7 Conclusions . 46 4 Distant Supervision from Discourse Connectives 50 4.1 Discourse Connective Classification and Discourse Relation Extraction . 53 4.2 Experiments . 60 4.3 Related work . 68 4.4 Conclusion and Future Directions . 70 5 Neural Discourse Mixture Model 71 5.1 Corpus and Task Formulation . 73 5.2 The Neural Discourse Mixture Model Architecture . 74 5.3 Experimental Setup . 80 vii CONTENTS 5.4 Results and Discussion . 83 5.5 Skip-gram Discourse Vector Model . 85 5.6 Related work . 88 5.7 Conclusion . 89 6 Recurrent Neural Network for Discourse Analysis 91 6.1 Related Work . 93 6.2 Model Architectures . 94 6.3 Corpora and Implementation . 98 6.4 Experiment on the Second-level Sense in the PDTB . 100 6.5 Extending the results across label sets and languages . 104 6.6 Conclusions and future work . 107 7 Robust Light-weight Multilingual Neural Parser 108 7.1 Introduction . 108 7.2 Model description . 110 7.3 Experiments . 111 7.4 Error Analysis . 113 7.5 Conclusions . 115 8 Conclusions and Future Work 116 8.1 Distant Supervision with Neural Network . 117 8.2 Neural Network Model as a Cognitive Model . 118 8.3 Discourse Analysis in Other Languages and Domains . 118 8.4 Concluding Remarks . 119 viii List of Figures 2.1 Example of a structure under Rhetorical Structure Theory . 12 2.2 Example of discourse relations in the Penn Discourse Treebank. 16 2.3 System pipeline for the discourse parser . 17 2.4 Fully connected feedforward neural network architecture with one hidden layer. 20 2.5 Skip-gram word embedding . 23 2.6 Recurrent neural network architecture . 26 2.7 Discourse-driven RNN language model . 30 3.1 Bipartite graphs showing important features . 48 3.2 Coreferential rates in the PDTB . 49 4.1 Omission rates across discourse connective types . 61 4.2 Distributions of Jensen-Shannon context divergence . 63 4.3 Scattergram of all discourse connectives . 64 4.4 The efficacy of different discourse connectives classes in distant supervision . 65 4.5 Comparing distant supervision from the PDTB and Gigaword as sources . 68 5.1 Neural Discourse Mixture Model . 74 5.2 Skip-gram architecture . 75 5.3 Mixture of Experts model . 78 5.4 Histogram of KL divergence values . 87 6.1 Feedforward and LSTM discourse parsers . 94 6.2 Summation pooling gives the best performance in general. The results are shown for the systems using 100-dimensional word vectors and one hidden layer.101 6.3 Inter-argument interaction . 101 6.4 Comparison between feedforward and LSTM . 102 6.5 Comparing the accuracies across Chinese word vectors for feedforward model. 107 7.1 Light-weight robust model architecture . 110 1 List of Tables 2.1 The relation definitions in Rhetorical Structure Theory . 13 2.2 Sense hierarchy in the Penn Discourse Treebank 2.0 . 15 2.3 Discourse parser performance summary . 17 3.1 The distribution of senses of implicit discourse relations is imbalanced. 33 3.2 Brown cluster pair feature performance . 37 3.3 Ablation studies of various feature sets . 38 4.1 Discourse connective classification . 62 4.2 The distribution of senses of implicit discourse relations in the PDTB . 62 4.3 Performance summary of distant supervision from discourse connectives in 4-way classification . 66 4.4 Performance summary of distant supervision from discourse connectives in binary classification formulation . 67 4.5 Comparison of different discourse connective classes . 67 4.6 The sense distribution by connective class. 69 5.1 Tuned hyperparameters of the models . 83 5.2 Performance summary of the Neural Discourse Mixture Model . 84 5.3 Clustering analysis of Skip-gram discourse vector model . 90 6.1 The distribution of the level 2 sense labels in the Penn Discourse Treebank . 98 6.2 Performance comparison across different models for second-level senses. 100 6.3 Results for all experimental configuration for 11-way classification . 102 6.4 Chinese model summary . 106 7.1 Sense-wise F1 scores for English non-explicit relations . 113 7.2 Sense-wise F1 scores for Chinese non-explicit discourse relation. 114 7.3 Confusion pairs made by the model . 114 2 Chapter 1 Introduction Discourse analysis is generally thought of as the analysis of language beyond the sentence level. Discourse-level phenomena affect the interpretation of the text in many ways as sentences rarely stand in isolation. When sentences form a discourse, the semantic interpretation of the discourse is more than the sum of individual sentences. The ordering of sentences within affects the interpretation of the discourse.

Load more