Exploiting Cross-Lingual Representations for Natural Language Processing Shyam Upadhyay University of Pennsylvania, [email protected]

University of Pennsylvania ScholarlyCommons Publicly Accessible Penn Dissertations 2019 Exploiting Cross-Lingual Representations For Natural Language Processing Shyam Upadhyay University of Pennsylvania, [email protected] Follow this and additional works at: https://repository.upenn.edu/edissertations Part of the Computer Sciences Commons Recommended Citation Upadhyay, Shyam, "Exploiting Cross-Lingual Representations For Natural Language Processing" (2019). Publicly Accessible Penn Dissertations. 3265. https://repository.upenn.edu/edissertations/3265 This paper is posted at ScholarlyCommons. https://repository.upenn.edu/edissertations/3265 For more information, please contact [email protected]. Exploiting Cross-Lingual Representations For Natural Language Processing Abstract Traditional approaches to supervised learning require a generous amount of labeled data for good generalization. While such annotation-heavy approaches have proven useful for some Natural Language Processing (NLP) tasks in high-resource languages (like English), they are unlikely to scale to languages where collecting labeled data is di cult and time-consuming. Translating supervision available in English is also not a viable solution, because developing a good machine translation system requires expensive to annotate resources which are not available for most languages. In this thesis, I argue that cross-lingual representations are an effective means of extending NLP tools to languages beyond English without resorting to generous amounts of annotated data or expensive machine translation. These representations can be learned in an inexpensive manner, often from signals completely unrelated to the task of interest. I begin with a review of different ways of inducing such representations using a variety of cross-lingual signals and study algorithmic approaches of using them in a diverse set of downstream tasks. Examples of such tasks covered in this thesis include learning representations to transfer a trained model across languages for document classification, assist in monolingual lexical semantics like word sense induction, identify asymmetric lexical relationships like hypernymy between words in different languages, or combining supervision across languages through a shared feature space for cross-lingual entity linking. In all these applications, the representations make information expressed in other languages available in English, while requiring minimal additional supervision in the language of interest. Degree Type Dissertation Degree Name Doctor of Philosophy (PhD) Graduate Group Computer and Information Science First Advisor Dan Roth Keywords cross-lingual, low supervision, multilingual, natural language processing, representation learning Subject Categories Computer Sciences This dissertation is available at ScholarlyCommons: https://repository.upenn.edu/edissertations/3265 EXPLOITING CROSS-LINGUAL REPRESENTATIONS FOR NATURAL LANGUAGE PROCESSING Shyam Upadhyay A DISSERTATION in Computer and Information Science Presented to the Faculties of the University of Pennsylvania in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy 2019 Supervisor of Dissertation Dan Roth, Professor of Computer and Information Science Graduate Group Chairperson Rajeev Alur, Professor of Computer and Information Science Dissertation Committee Adam Kalai, Principal Researcher, Microsoft Research Chris Callison-Burch, Professor of Computer and Information Science Lyle Ungar, Professor of Computer and Information Science Mitchell P. Marcus, Professor of Computer and Information Science EXPLOITING CROSS-LINGUAL REPRESENTATIONS FOR NATURAL LANGUAGE PROCESSING © COPYRIGHT 2019 Shyam Upadhyay This work is licensed under the Creative Commons Attribution NonCommercial-ShareAlike 3.0 License To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/ Dedicated to Maa and Papa. iii ACKNOWLEDGEMENT Nothing of me is original. I am the combined effort of everybody I’ve ever known. Chuck Palahniuk, Invisible Monsters Very few Ph.D. students get the opportunity of working with a great advisor like Dan. Without his continuous support, patience and his open style of advising that grants research freedom, most of the work in this thesis would not have seen the light of day. Besides being a great advisor, Dan is one of the kindest and most considerate person I know. I am immensely grateful for all that he has done for me. I am also indebted to my thesis committee — Adam, Mitch, Lyle, and Chris. Not only have they provided valuable feedback on the thesis at various stages, but also have been a source of sage advice. During my Ph.D., I also got the valuable opportunity to work with some of the best re- searchers in the field through internships at Microsoft Research and Google. Working with Ming-Wei and Scott was instrumental in shaping my thinking and temperament for research. The wisdom I got from Ming-Wei will be invaluable for the rest of my research life. The next summer at MSR Cambridge was equally enjoyable, where Adam, Matt, Kai-Wei, and James gave me the freedom to define and work on a research problem, and helping me find a way towards a solution. Lastly, the summer spent working with Gokhan, Dilek and Larry taught me how to approach problems outside my comfort zone, and keeping in mind practical considerations when thinking about new ideas. I am thankful for all these experiences, which have shaped my outlook and approach towards research. I would like to thank former and current members of CogComp — especially Mark Sammons, Kai-Wei Chang, Rajhans Samdani, and Gourab Kundu, who helped me fit in the group in my early days. Working in CogComp also brought me in contact with Christos, Parisa, Michael, Snigdha, and Wenpeng, who took the trouble of mentoring me when they were themselves wrestling with hard research problems. Thanks to Christos and Michael for iv giving me research advice at different stages of my Ph.D., and constantly reminding me “not to work too hard over the weekends”. I am also indebted to Yangqiu Song, who really did all the hard work that later became Chapter 4. Special thanks to Snigdha, who was influential in helping me frame the early draft of this thesis. During my stay at UIUC and UPenn, I developed close friends in brilliant fellow students like Subhro, Haoruo, Nitish, Daniel, John, Chen-Tse, Stephen, Dan (Deutsch), Qiang, Anne, Jordan, Reno, João, Daphne, and others. When I reflect how much I have learned from all of you over the years, it amazes me. I am sure you will have prolific and satisfying careers ahead of you. I will dearly miss the discussions over lunches and happy hours we had. I will especially miss the evening meals I had with Subhro, Nitish, Snigdha, and Shashank. The not-so-good Indian food in the US will taste even worse without your company. I would be remiss in not thanking my remote collaborators — Manaal, Chris (Dyer), Yo- garshi, and Marine. Working with all of you was a unique experience and I have acquired countless skills and research wisdom from our interactions. Special thanks to Manaal, who I often trouble for research and career advice from time to time. It would be unfair not to thank the folks behind the scenes who ensure things run smoothly in CogComp at UIUC and UPenn. Thanks to Dawn Cheek, Eric Horn and Jennifer Sheffield, who often went out of their way to help students like me. Lastly, I would like to thank Maa, Papa and everyone in my family whose efforts and sacrifices got me here. Thanks for your unconditional love and support when things werenot going well, and uplifting my spirits when I was facing the despair that naturally accompanies the research process. This is for you. v ABSTRACT EXPLOITING CROSS-LINGUAL REPRESENTATIONS FOR NATURAL LANGUAGE PROCESSING Shyam Upadhyay Dan Roth Traditional approaches to supervised learning require a generous amount of labeled data for good generalization. While such annotation-heavy approaches have proven useful for some Natural Language Processing (NLP) tasks in high-resource languages (like English), they are unlikely to scale to languages where collecting labeled data is difficult and time-consuming. Translating supervision available in English is also not a viable solution, because developing a good machine translation system requires expensive to annotate resources which are not available for most languages. In this thesis, I argue that cross-lingual representations are an effective means of extending NLP tools to languages beyond English without resorting to generous amounts of annotated data or expensive machine translation. These representations can be learned in an inexpensive manner, often from signals completely unrelated to the task of interest. I begin with a review of different ways of inducing such representations using a variety of cross-lingual signals and study algorithmic approaches of using them in a diverse set of downstream tasks. Examples of such tasks covered in this thesis include learning representations to transfer a trained model across languages for document classification, assist in monolingual lexical semantics like word sense induction, identify asymmetric lexical relationships like hypernymy between words in different languages, or combining supervision across languages through a shared feature space for cross-lingual entity linking. In all these applications, the representations make information expressed in other languages available in English, while requiring minimal

Exploiting Cross-Lingual Representations for Natural Language Processing Shyam Upadhyay University of Pennsylvania, [email protected]

Cultural Anthropology Through the Lens of Wikipedia: Historical Leader Networks, Gender Bias, and News-Based Sentiment

Universality, Similarity, and Translation in the Wikipedia Inter-Language Link Network

A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages

State of Wikimedia Communities of India

Modeling Popularity and Reliability of Sources in Multilingual Wikipedia

Omnipedia: Bridging the Wikipedia Language

International Journal of Computational Linguistics

China Date: 8 January 2007

THE CASE of WIKIPEDIA Shane Greenstein Feng Zhu

A Natural Experiment at Chinese Wikipedia

Texts from De.Wikipedia.Org, De.Wikisource.Org

Mapping the Nature of Content and Processes on the English Wikipedia