Named Entity Recognition Tagging for Telugu 15-300, Fall 2016

Project Proposal for 15-400, Spring 2017: Named Entity Recognition Tagging for Telugu 15-300, Fall 2016 Sriharini Pingali [email protected] November 7, 2016 1 Project Overview Faculty David Mortensen (http://www.cs.cmu.edu/~dmortens/) will be my faculty advisor. He is special faculty in the Language Technologies Institute (LTI), focusing on applications of linguistics to low-resource Language Technology (LT). I will also be working with Lorraine Levin (http://www.cs.cmu.edu/~lsl/), tenured faculty in the Language Technologies Institute, spe- cializing in low-resource Machine Translation (MT). I have been told there is a Master's student in LTI who is a native speaker of Malayalam, a Dravidian language related to Telugu, who may be able to work with me on this project depending on his availability. So far I have not received confirmation or details on this, but we are sorting it out. Description A major problem in LT today is developing good LT systems for low-resource languages. Most reliable, commonly used Natural Language Processing (NLP) algorithms rely on extremely large, comprehensive corpora to provide training data to the Machine Learning (ML) algorithms that underlie them. However, with insuffient training data, these algorithms may fail to produce the best LT systems, and this causes a huge problem for the majority of spoken languages, since they may not have as abundant a variety of corpora as more ubiquitous languages like English. An example of such a low-resource language, and the language my project will focus on, is Telugu. Telugu is the fifteenth most spoken language in the world by population, with a total of 75 million speakers,1 but still LT for Telugu has not yet been developed to an acceptable qualitative level. This research group is investigating methods for cross-lingual transfer, techniques for converting an optimally functioning LT system for one language to a well-performing system in another language using as little additional training data as possible, which is obviously promising for low-resource languages. An outstanding question in this area is, \Are certain pairs of languages more apt to successful cross-lingual transfer than others?" What my project hypothesizes is that cross-lingual transfer performs optimally between languages belonging to the same language family|that is, the closer two languages are related, the more successful cross-lingual transfer between them will be due to their linguistic similarities. In historical and comparative 1Statistic taken from http://www.vistawide.com/languages/top_30_languages.htm. 1 linguistics, languages descended from the same ancestral language are said to belong to the same language family. One such language family is the Dravidian family, a group of languages spoken in South India derived from a Proto-Dravidian language dating back to the second century bce, including Telugu, Tamil, Kannada and Malayalam. This project will test our hypothesis by eval- uating the performance of named entity recognition (NER) taggers built by cross-lingual transfer between two languages belonging to the Dravidian family. NER taggers are an essential component of MT systems, which identify and categorize named people, locations, times, etc. like \Barack Obama," \White House," \2016," etc. in a text so that the system does not attempt to translate them as literal, semantic words. Coupling my NLP background with my native knowledge of Telugu, I will build an NER tagger for the Telugu language. If this is completed before the end of the semester, the next step would be to use my system to build an NER tagger in a different Dravidian language using cross-lingual transfer (most likely for Tamil or Malayalam based on the data we can acquire), and evaluate its performance compared to a system built using cross-lingual transfer from an NER tagger for English (or some other unrelated, non-Dravidian language). Motivation The primary motivation of this project is participation in the Defense Advanced Research Projects Agency (DARPA) initiative for Low Resource Languages for Emergent Incidents (LORELEI).2 The goal of this initiative is to \enable rapid, low-cost development of capabilities for low- resource languages" in \any incident in which a sudden need emerges for assimilation of infor- mation . about a region of the world where low-resource languages are frequently used." Thus being able to build working LT systems quickly and with minimal data will support immediate humanitarian assistance for disaster relief in areas where low-resource languages are primarily used. More than 90% of the world's population speaks a language related to one of the top twenty most spoken languages in the world, so if we can successfully build an inventory of LT systems for those languages, our research would imply that we could easily build good systems for almost any other language with cross-lingual transfer. Objectives This is what I hope to accomplish, if things go as planned: 75%: Get the building blocks for a Telugu NER tagger. Even if I can't finish the NER tagger, I want to finish annotating Telugu corpora and have a working morphological analyzer. 100%: Complete development of a working Telugu NER tagger. 125%: Apply cross-lingual transfer from Telugu to another Dravidian language and begin the testing and evaluation phase to see if using a related language is actually better. Web Page www.andrew.cmu.edu/user/spingali/15400/ 2 Project Milestones First Technical Milestone for 15-300 By the end of this semester, I plan to catch up on the progress made by the research group so far on cross-lingual transfer, and have clarity on the best approach(es) to building an NER 2http://www.darpa.mil/program/low-resource-languages-for-emergent-incidents 2 tagger to anticipate what I might need to accomplish by mid-semester to stay on track. Papers to read are listed in the \Literature" section below. Bi-Weekly Milestones for 15-400 These are subject to change, but here is the current plan: January 30th Learn to use the systems already in place for this project (NLP side of the project) and the established protocol for writing rules and annotating corpora (computational linguistics side). February 13th Finish annotating the Telugu corpora and preparing development data. February 27th Finish writing an initial set of phonological rules to develop a finite-state traducer (FST) based morphological analyzer. Begin testing and refining this set of rules. March 20th Finish building and testing the morphological analyzer. April 3rd Finish planning NER implementation and begin development. April 17th Finish building NER tagger and begin testing. May 1st Finish testing and refining NER tagger and try cross-lingual transfer if time. 3 Project Prequisites Literature V. Chahuneau, E. Schlinger, N. A. Smith, C. Dyer. \Translating into Morphologically Rich Languages with Synthetic Phrases." Sep. 2013.3 Y. Zhang, R. Barzilay. \Hierarchical Low-Rank Tensors for Multingual Transfer Parsing." Aug. 2015.4 Software An interesting and useful tool David has developed, called EPiTran, converts orthographic texts in a language to phonetic transcriptions using the International Phonetic Alphabet (IPA). We may be able to refine cross-lingual transfer by analyzing the phonological similarities of these IPA transcriptions in order to find cognates between two languages. I may need the source code for this tool to incorporate it into the cross-lingual transfer process. Lori also told me about a project she worked on 10 years ago with a Telugu-speaking pro- fessor in Computational Biology, Madhavi Ganapathiraju, to build a morphological analyzer for Telugu. The code is incomplete, out-of-date and does not appear to be restorable, but we may attempt to acquire access to it anyway for reference when building my Telugu morphological analyzer. 3http://victor.chahuneau.fr/pub/morphogen/morphogen-chahuneau-emnlp-2013.pdf 4https://people.csail.mit.edu/yuanzh/papers/emnlp2015_hierarchy.pdf 3.

Load more