Multimodal Semantic Understanding: Semantic Role Labeling with Vision and Language

Multimodal Semantic Understanding: Semantic Role Labeling with Vision and Language by Zijiao Yang B.E., Dalian University of Technology 2016 B.E., Ritsumeikan University, 2016 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Master of Science Department of Computer Science 2020 Committee Members: James Martin Martha Palmer Chenhao Tan ii Yang, Zijiao (M.S., Computer Science) Multimodal Semantic Understanding: Semantic Role Labeling with Vision and Language Thesis directed by Prof. James Martin Semantic role labeling (SRL) is an important task in natural language understanding systems. Recently, researchers have moved from traditional syntactic feature based pipelines to end-to-end (merely token to tag, mostly) neural network based SRL system, and state-of-the-art performance has been reported from those systems. Nonetheless, the recent trend of inducing syntactic information back to neural models has led to a big success [Strubell et al., 2018]. On the other hand, language understanding should not be a single modal task, the most pervasive evidence is our- selves. Incorporating information from other modalities has drawn attention [Baltrusaitis et al., 2019]. This thesis introduces two different models trying to utilize image information to help SRL task. The two models use distinct ways to integrate between the vision and language modalities. Also, the models are trained on automatically generated data, and evaluated on ground truth data. The results and analysis are presented. Dedication To my mom, dad, my girlfriend Yuping, and all my friends. iv Acknowledgements I want to thank Professor. James Martin, for his help throughout my thesis process, as well as his insightful opinions on various other projects that could be related to this thesis. I would also like to thank my committee members. This thesis originates from a class project of Professor Martha Palmer's Computational Lexical Semantics class. And Professor Martha Palmer provided me valuable feedback on related topics and connects me with her student Abhidip Bhattacharyya. Also they helped with a lot problems regrading the datasets I used. I thank Professor Chenhao Tan for his valuable comments and feedback and some most important skills I learned for proceeding this thesis is within a class with him. Finally, I want to thank Abhidip Bhattacharyya for the valuable discussions we had and the tools he pointed me to. v Contents Chapter 1 Introduction 1 1.1 The Necessity of Knowledge Enrichment to NLP Systems . .1 1.2 Task Selection . .3 1.3 Goal Description and Question Proposition . .4 1.4 Thesis Outline . .4 2 Previous Work 5 2.1 Semantic Role Labeling . .5 2.2 Multimodal Learning . 10 3 Data Description 14 3.1 Datasets Description . 14 3.2 Training and Validation Data Generation . 15 3.3 Test Data Description . 18 4 Model Description 20 4.1 Text Model . 21 4.1.1 Task Formulation . 21 4.1.2 BiLSTM . 22 4.1.3 Highway Connections . 24 vi 4.1.4 Decoding . 26 4.2 Alignment Model . 26 4.2.1 Image Module . 26 4.2.2 Alignment Module . 28 4.2.3 Loss . 29 4.3 Attention Model . 29 4.3.1 Attention Module . 30 4.4 Training Setup . 30 5 Results and Analysis 33 5.1 Results . 33 5.2 Analysis . 34 5.2.1 Ablation . 34 5.2.2 Confusion Matrix . 35 5.2.3 Attention and Decoding . 36 6 Conclusion and Future Work 49 6.1 Conclusion . 49 6.2 Future Work . 50 Bibliography 52 vii Tables Table 2.1 Some of the common modifiers in current Propbank (not a complete list, only include the semantic role concerned in this thesis) [Bonial et al., 2012], each row is a modifier, its short definition, and one example. Attention on the emphasized texts. .8 5.1 This table gives the average precision, recall and F1 over five training instances with different random initialization for t-SRL, al-SRL, at-SRL. In side the brackets, the standard deviation across 5 training instances is reported . 34 5.2 This table gives the precision, recall, and f1 for each semantic role label, along with overall score for t-SRL model results (This is not a complete table) . 35 5.3 This table gives the precision, recall, and f1 for each semantic role label, along with overall score for al-SRL model results (This is not a complete table) . 36 5.4 This table gives the precision, recall, and f1 for each semantic role label, along with overall score for at-SRL model results (This is not a complete table) . 37 5.5 This table gives the result of ablation, involved models are the baseline model t-SRL, at-SRL, at-SRL without highway connection (no hi), and at-SRL without decoding (no decoding), the precision, recall and F1 for each model is reported. The number in brackets represent the F1 drop compared to at-SRL . 38 5.6 This table gives the precision, recall and F1 score with respect to individual semantic role labels as well as overall results for t-SRL. It is the computed based on the predicted results without decoding (it is not a complete table) . 39 viii 5.7 This table gives the precision, recall and F1 score with respect to individual semantic role labels as well as overall results for al-SRL. It is the computed based on the predicted results without decoding (it is not a complete table) . 39 5.8 This table gives the precision, recall and F1 score with respect to individual semantic role labels as well as overall results for at-SRL. It is the computed based on the predicted results without decoding (it is not a complete table) . 45 5.9 Performance drop without decoder on precision, recall, F-1, for t-SRL, al-SRL, and at-SRL . 45 ix Figures Figure 2.1 An example of span-based SRL . .6 3.1 An example from MSCOCO dataset. Each image is paired with five captions . 16 3.2 An example from Flickr30k dataset. The image data contains bounding boxes de- noted by color-coded squares. (The bounding boxes shown here is from original Flickr30k dataset, which is human annotated, but the bounding box features used in this thesis are generated follow [Tan and Bansal, 2019]) The entities in the captions are also color-coded. Same color indicates correspondences between bounding boxes and entities mentioned in captions. 17 4.1 3 layer BiLSTM with highway connections, pink arrow denotes the highway connection; the red boxes are the gates that control the integration of highway knowledge and hidden states. 22 4.2 The alignment model al-SRL: it consists of three parts: image module(bottom left, blue box), BiLSTM (bottom right), and alignment module (where the alignment loss is computed). A sentence embedding is extracted from BiLSTM hidden states and resized image embedding to be fed into the alignment module for computing alignment loss. Two losses are integrated to form final loss. 27 x 4.3 The attention module: the initial layers of text part are identical to t-SRL4.1. Bound- ing box features are obtained by passing the concatenation of bounding box vector and coordinates to the image module, as described in 4.2.1. Then context vector is computed for each hidden state by attention module. Finally, the context vectors and hidden states are concatenated together to make final predication. 32 5.1 The confusion matrix for t-SRL, (note it is a simplified version of complete confusion matrix. Darker color denotes more mistakes, each celli;j represents should be label i, but predicted as label j . 40 5.2 The confusion matrix for al-SRL, (note it is a simplified version of complete confusion matrix. Darker color denotes more mistakes, each celli;j represents should be label i, but predicted as label j . 41 5.3 The confusion matrix for at-SRL. (note it is a simplified version of the complete confusion matrix.) Darker color denotes more mistakes, each celli;j represents should be label i, but predicted as label j . 42 5.4 This heatmap represents the confusion difference between t-SRL and at-SRL without decoding. It is computed by subtracting the confusion matrix of at-SRL (no decoding) from t-SRL (no decoding). Blue color represents the confusion decrease for at-SRL compared with t-SRL; for example, cellLOC;ARG2 represents at-SRL is less confused about LOC with ARG2 by 11. Furthermore, red color means the confusion increase for at-SRL compared with t-SRL. 43 5.5 This heatmap represents the confusion difference between t-SRL and at-SRL with decoding. It is computed by subtracting the confusion matrix of at-SRL from t-SRL. Blue color represents the confusion decrease for at-SRL compared with t-SRL; for example, cellLOC;ARG2 represents at-SRL is less confused about LOC with ARG2 by 14. Furthermore, red color means the confusion increase for at-SRL compared with t-SRL. 44 xi 5.6 The confusion matrix for t-SRL without decoding, (note it is a simplified version of complete confusion matrix. Darker color denotes more mistakes, each celli;j represents should be label i, but predicted as label j . 46 5.7 The confusion matrix for al-SRL without decoding, (note it is a simplified version of complete confusion matrix. Darker color denotes more mistakes, each celli;j represents should be label i, but predicted as label j . 47 5.8 The confusion matrix for at-SRL without decoding, (note it is a simplified version of complete confusion matrix.

Multimodal Semantic Understanding: Semantic Role Labeling with Vision and Language

Dan Jurafsky: Curriculum Vitae

GAZETTE Dan Jurafsky

Towards a Computational History of the ACL: 1980-2008

Dan Jurafsky Linguistics and Computer Science Stanford University

Dan Jurafsky

Christopher David Manning

Dan Jurafsky

Which Words Are Hard to Recognize? Prosodic, Lexical, and Disfluency

Steven Bethard

Linguistic Structure in Statistical Machine Translation

Dan Jurafsky: Curriculum Vitae

Dan Jurafsky