Learning and Applications of Paraphrastic Representations for Natural Language, Ph.D
Total Page:16
File Type:pdf, Size:1020Kb
LEARNINGANDAPPLICATIONSOFPARAPHRASTIC REPRESENTATIONSFORNATURALLANGUAGE john wieting Ph.D. Thesis Language Technologies Institute Carnegie Mellon University July 27th 2020 John Wieting: Learning and Applications of Paraphrastic Representations for Natural Language, Ph.D. Thesis, © July 27th 2020 ABSTRACT Representation learning has had a tremendous impact in machine learning and natural lan- guage processing (NLP), especially in recent years. Learned representations provide useful features needed for downstream tasks, allowing models to incorporate knowledge from billions of tokens of text. The result is better performance and generalization on many important problems of interest. Often these representations can also be used in an un- supervised manner to determine the degree of semantic similarity of text or for finding semantically similar items, the latter useful for mining paraphrases or parallel text. Lastly, representations can be probed to better understand what aspects of language have been learned, bringing an additional element of interpretability to our models. This thesis focuses on the problem of learning paraphrastic representations for units of language. These units span from sub-words, to words, to phrases, and to full sentences – the latter being a focal point. Our primary goal is to learn models that can encode arbitrary word sequences into a vector with the property that sequences with similar semantics are near each other in the learned vector space, and that this property transfers across domains. We first show several effective and simple models, paragram and charagram, to learn word and sentence representations on noisy paraphrases automatically extracted from bilin- gual corpora. These models outperform contemporary and more complicated models on a variety of semantic evaluations. We then propose techniques to enable deep networks to learn effective semantic represen- tations, addressing a limitation of our prior work. We found that in order to learn represen- tations for sentences with deeper, more expressive neural networks, we need large amounts of sentential paraphrase data. Since this did not exist yet, we utilized neural machine trans- lation models to create ParaNMT-50M, a corpus of 50 million English paraphrases which has found numerous uses by NLP researchers, in addition to providing further gains on our learned paraphrastic sentence representations. We next propose models for bilingual paraphrastic sentence representations. We first pro- pose a simple and effective approach that outperforms more complicated methods on cross- lingual sentence similarity and mining bitext, and we also show that we can also achieve strong monolingual performance without paraphrase corpora by just using parallel text. iii We then propose a generative model capable of concentrating semantic information into our embeddings and separating out extraneous information by viewing parallel text as two different views of a semantic concept. We found that this model has improved performance on both monolingual and cross-lingual tasks. Lastly, we extend this bilingual model to the multilingual setting and show it can be effective on multiple languages simultaneously, significantly surpassing contemporary multilingual models. Finally, this thesis concludes by showing applications of our learned representations and ParaNMT-50M. The first of these is on generating paraphrases with syntactic control for making classifiers more robust to adversarial attacks. We found that we can generate a controlled paraphrase for a sentence by supplying just the top production of the desired constituent parse – and the generated sentence will follow this structure, filling in the rest of the tree as needed to create the paraphrase. The second application is applying our representations for fine-tuning neural machine translation systems using minimum risk training. The conventional approach is to use BLEU (Papineni et al., 2002), since that is what is commonly used for evaluation. However, we found that using an embedding model to evaluate similarity allows the range of possible scores to be continuous and, as a result, introduces fine-grained distinctions between similar translations. The result is better performance on both human evaluations and BLEU score, along with faster convergence during training. iv CONTENTS 1 introduction8 2 background 14 2.1 Representation Learning in Natural Language Processing . 14 2.1.1 Subword Embeddings . 14 2.1.2 Word Embeddings . 16 2.1.3 Sentence Embeddings . 17 2.1.4 Contextualized Embeddings . 19 2.2 Evaluating Representations . 20 2.2.1 Word Embedding Evaluation . 20 2.2.2 Sentence Embedding Evaluation . 20 2.3 Paraphrase Corpora . 24 2.3.1 Paraphrase Database . 24 2.3.2 Sentential Paraphrase Corpora . 25 i paraphrastic embeddings 26 3 recursive neural networks and paragram word embeddings 29 3.1 Introduction . 29 3.2 Related Work . 31 3.3 New Paraphrase Datasets . 32 3.3.1 Annotated-PPDB . 32 3.3.2 ML-Paraphrase . 34 3.4 Paraphrase Models . 35 3.4.1 Objective Functions . 36 3.5 Experiments – Word Paraphrasing . 38 3.5.1 Training Procedure . 38 3.5.2 Extracting Training Data . 39 3.5.3 Tuning and Evaluation . 39 3.5.4 Results . 39 3.5.5 Sentiment Analysis . 40 3.5.6 Scaling up to 300 Dimensions . 41 v contents vi 3.6 Experiments – Compositional Paraphrasing . 42 3.6.1 Training Procedure . 43 3.6.2 Evaluation and Baselines . 43 3.6.3 Bigram Paraphrasability . 43 3.6.4 Phrase Paraphrasability . 45 3.7 Qualitative Analysis . 47 3.8 Conclusion . 49 4 paragram sentence representations defined and explored 51 4.1 Introduction . 52 4.2 Related Work . 54 4.3 Models and Training . 56 4.3.1 Training . 57 4.4 Experiments . 59 4.4.1 Data ...................................... 59 4.4.2 Transfer Learning . 59 4.4.3 paragram-phrase XXL . 62 4.4.4 Using Representations in Learned Models . 62 4.5 Discussion . 67 4.5.1 Under-Trained Embeddings . 68 4.5.2 Using More PPDB . 69 4.6 Qualitative Analysis . 70 4.7 Conclusion . 72 5 charagram: improved generalization by embedding words and sen- tences via character ngrams 73 5.1 Introduction . 73 5.2 Related Work . 74 5.3 Models . 75 5.4 Experiments . 76 5.4.1 Word and Sentence Similarity . 77 5.4.2 Selecting Negative Examples . 77 5.4.3 POS Tagging Experiments . 84 5.4.4 Convergence . 85 5.4.5 Model Size Experiments . 86 5.5 Analysis . 87 contents vii 5.5.1 Quantitative Analysis . 87 5.5.2 Qualitative Analysis . 88 5.6 Conclusion . 90 ii automatically creating paraphrase corpora for improved sen- tence embedding and generation tasks 91 6 revisiting recurrent networks for paraphrastic sentence embed- dings 92 6.1 Introduction . 92 6.2 Related Work . 93 6.3 Models and Training . 94 6.3.1 Models . 94 6.3.2 Training . 96 6.4 Experiments . 97 6.4.1 Transfer Learning . 97 6.4.2 Experiments with Regularization . 99 6.4.3 Supervised Text Similarity . 103 6.5 Analysis . 105 6.5.1 Error Analysis . 105 6.5.2 GRAN Gate Analysis . 106 6.6 Conclusion . 108 7 learning paraphrastic sentence embeddings from back-translated bitext 109 7.1 Introduction . 109 7.2 Related Work . 110 7.3 Neural Machine Translation . 112 7.4 Models and Training . 113 7.4.1 Models . 114 7.4.2 Training . 114 7.5 Experiments . 115 7.5.1 Evaluation . 115 7.5.2 Experimental Setup . 116 7.5.3 Dataset Comparison . 116 7.5.4 Filtering Methods . ..