Haitian Creole: How to Build and Ship an MT Engine from Scratchin4 Days, 17 Hours, & 30 Minutes

Haitian Creole: How to Build and Ship an MT Engine from Scratchin4 Days, 17 Hours, & 30 Minutes

Haitian Creole: How to Build and Ship an MT Engine from Scratchin4 days, 17 hours, & 30 minutes William D. Lewis Microsoft Research One Microsoft Way Redmond, WA 98052 [email protected] Abstract with data for our engines. This entire effort would not have succeeded without their help. We describe the effort of the Microsoft Translator team to develop a Haitian Cre- 2 Introduction ole statistical machine translation engine The unprecedented disaster in Haiti triggered a from scratch in a matter of days. Haitian massive relief effort from around the world. Since Creole presents a number of difficulties a significant portion of the population of Haiti for devleoping an SMT system, princi- speaks Haitian Creole—approximately seven out pal among these is the lack of significant of every eight people in Haiti speak the lan- amounts of parallel training data and an in- guage1— and since Haitian Creole is not widely consistent orthography, both of which lead spoken outside of Haiti, it became obvious that to data sparseness. We demonstrate, how- Machine Translation might be a useful tool for the ever, that it is possible to build a trans- relief effort in Haiti, where it could be applied lation engine of reasonable quality over to translate emergency relief documents, medical very little data by engaging with the native documents, SMS text messages, and even common language community and reducing data phrases and expressions. Unfortunately, no readily sparseness in creative ways. As such, we available Machine Translation engine existed for show that MT as a technology and as a ser- Haitian Creole at the time. Our team received an vice can be deployed rapidly in crisis situ- e-mail from colleagues on the ground in Haiti ask- ations. ing if we could develop a translation engine to aid in the relief effort. Here we relate this effort and 1 Credits the daunting challenges we faced to produce such Credit is due to all members of the Microsoft an engine from scratch in just a few days. Translator team who spent many sleepless nights putting together and shipping the Haitian Creole 3 Motivation translation engines. I am proud and humbled to Haitian Creole is a resource poor language spoken call you colleagues. Our whole team in turn is in- principally on the western portion of the Island of debted to the researchers and data providers who Hispaniola. On January 12th, 2010 a devastating helped us locate or directly provided us with train- earthquake struck the island, with an epicenter in ing data for our engines, to the Butler Hill Group and their employees who donated many, many 1This calculation based on the number of speakers hours to the effort, to Moravia Worldwide and We- in Haiti of 6,960,000 in 2001 as listed Ethnologue (http://www.ethnologue.com/show language.asp?code=hat), localize who helped translate content, and most and an extrapolation of the United Nations’ esti- especially to the volunteers involved in the relief mate of Haiti’s total population of 8,326,000 in 2003, reduced by the country’s estimated annual effort, especially those at Ushahidi and Mission growth rate of 1.32% per year from 2000-2005 4636, who offered crucial advice and provided us (http://www.nationsencyclopedia.com/Americas/Haiti- POPULATION.html), to a figure of 8,107,644 for 2001 c 2010 European Association for Machine Translation. the neighborhood of Port-au-Prince. Due to the Cebuano Statistical Machine Translation system, strength of the quake, and generally poor build- built over resources collected as part of the SLE. ing standards in the city, many tens of thousands of people were killed during the quake and in the days 4 The Challenge and weeks that followed; many more were left On January 19th, 2010, one week after the earth- without shelter, food or water.2 The humanitarian quake, we received an e-mail from a colleague crisis prompted relief organizations to come to the who was involved in the aid effort asking us if it aid of the Haitian people. It became evident early would be possible for us to develop a translation in the crisis that medical and relief documentation engine for Haitian Creole. Within a few hours, in the native Haitian Creole language would be of we rallied a small team of developers, testers, and significant utility to relief workers on the ground in computational linguists to decide on how best we the country. Efforts were started to locate medical could develop such an engine. No one on our team and relief related documents in Creole and make had any knowledge of Creole: no native speakers, these freely available.3 Further, since aid organiza- no linguistic background in the language (other tions such as Usahidi and Mission 4636 had built than trivial knowledge learned in college), no idea up an infrastructure to receive text messages from about the grammatical structure, encoding, orthog- those seeking aid, translating the many thousands raphy, registers (e.g., differences in training data of the Haitian Creole text messages into English in between high and low registers), degree of literacy real time became critical. Based on these and other in the speaker population, etc. Further, we had no potential uses, a Haitian Creole Machine Transla- Haitian Creole data of any kind, nor any readily tion engine might be of significant value to the re- available documents or other materials in Haitian lief effort. Creole or about Haitian Creole. In effect, we were Developing NLP tools and resources for low starting at zero. density or resource poor languages is not with- A resource poor language presents a daunt- out precedent. A notable example of this was ing challenge for developing a Statistical Machine the Surprise Language Exercise (SLE) sponsored Translation (SMT) engine. Since, at the time, we by the DARPA Translingual Information Detec- knew of no corpora of any size in the language, in- tion Extraction and Summarization (TIDES) pro- cluding no known annotated corpora nor parallel gram (Oard, 2003). The idea behind this initia- corpora, developing an initial set of tools to ramp tive was to “address the need for technologies in up the effort would prove to be exceedingly dif- less common languages through experiments in ficult. Although there are public and government rapid technology porting where data collection, re- web sites with Haitian Creole content on the Web, source creation, and technology development take much of this content is in idiosyncratic formats place simultaneously within a very short time pe- (such as PDF) which are not easily consumed. riod (i.e., one month)” (Strassel et al., 2003). Six- Fortunately, Carnegie Mellon University re- teen research teams from around the globe partici- leased a small database of parallel Haitian Creole pated in the initiative, along with several additional and English data shortly after the quake4, which organizations that participated as data providers. was made freely available to the public with lim- (Strassel et al., 2003) discusses in detail the Lin- ited restrictions on use. Much of this data was de- guistic Data Consortium’s efforts to cobble to- veloped for a Speech to Speech Translation project gether sources of data for Cebuano and Hindi in a called DIPLOMAT which had been abandoned in compressed timeframes. Further, (Oard and Och, the 1990’s (Frederking et al., 1997). CrisisCom- 2003) present the results of a rapidly deployed mons made some data available to developers 5, although much of this data overlapped with readily 2 See http://en.wikipedia.org/wiki/2010 Haiti earthquake for available data on the Web (such as the Haitian Con- overview and references about the earthquake and the relief 6 effort. stitution) . Also available was the Haitian Creole 3E.g., Hesperian’s distribution of Where 4 There is No Doctor and Where Women Have See http://www.speech.cs.cmu.edu/haitian/ for the mate- No Doctor through free downloads. See rials, and http://www.cmu.edu/news/archive/2010/January/ http://www.kron.com/News/ArticleView/tabid/298/smid/1126/ jan27 haitiancreoletranslation.shtml for the initial press re- ArticleID/4775/reftab/610/t/Berkeley%20Nonprofit%20Helping lease. %20Quake%20Victims%20With%20Medical%20Materials 5http://wiki.crisiscommons.org/wiki/Machine Translation System %20Translated%20into%20Haitian%20Creole/Default.aspx. 6http://pdba.georgetown.edu/constitutions/haiti/haiti1987.html Bible, although the language in the Bible is some- 4636 text messages, in both Creole and En- what “stilted” and archaic, especially on the En- glish, with the former being translated into glish side. the latter by human translators from around the world. We used bilingual speakers to 5 The Plan clean up this content (since SMS messages We rapidly developed a plan for building the en- tend to be quite noisy, and the translations are gine: done quickly and therefore can be of mixed quality), and added it to our training data. • First, we needed to quickly identify avail- Table 1 shows some sample SMS messages able data sources and start processing them. and the various kinds of noise associated with The resources from CMU and the Bible were them.7 the first to be processed. We identified ad- ditional publicly available data, such as doc- 6 Some of the Challenges Creole uments provided by government agencies, Presented which were assigned to a small cadre of de- velopers to process. Many of these were in Beyond being a “low data” language, there were idiosyncratic formats (such as PDF), and so a number of challenges that Creole presented that were often one-off conversion and clean-up made creating a translation engine more difficult. projects, sometimes cleanable only through Since Creole is fairly “young” as a written lan- 8 some manual effort.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    6 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us