Corpus Development for Machine Translation Between Standard and Dialectal Varieties

Corpus development for machine translation between standard and dialectal varieties Barry Haddow1, Adolfo Hernandez´ Huerta2, Friedrich Neubarth2, Harald Trost3 1) ILCC, School of Informatics, University of Edinburgh [email protected] 2) Austrian Research Institute f. Artificial Intelligence (OFAI) fadolfo.hernandez, [email protected] 3) Institute for Artificial Intelligence, Medical University of Vienna [email protected] Abstract In this paper we will outline methods to ac- In this paper we describe the construction of a paral- quire such data, developed for a specific pair of lel corpus between the standard and a non-standard varieties, Austrian German (AG), the standard va- language variety, specifically standard Austrian riety, and a dialectal variety spoken in the capi- German and Viennese dialect. The resulting parallel corpus is used for statistical machine transla- tal, Viennese dialect (VD) (Schikola, 1954), (Hor- tion (SMT) from the standard to the non-standard nung, 1998).1 From a linguistic perspective, it has variety. The main challenges to our task are data to be noted that dialects generally are not really scarcity and the lack of an authoritative orthography. We started with the generation of a base corpus homogenous. Lacking standardization initiatives, of manually transcribed and translated data from reinforcement by education or public media and spoken text encoded in a specifically developed or- predominantly being confined to oral usage, di- thography. This data is used to train a first phrase- based SMT. To deal with out-of-vocabulary items alects most often form a dynamic continuum be- we exploit the strong proximity between source tween different varieties and speaker groups. Be- and target variety with a backoff strategy that uses character-level models. To arrive at the necessary ing defined by social group rather than geograph- size for a corpus to be used for SMT, we em- ical regions, the Viennese variety is a sociolect in ploy a boot-strapping approach. Integrating addi- the strict sense, where dialects in urban regions tional available sources (comparable corpora, such as Wikipedia) necessitates to identify parallel sen- are generally associated with lower social classes tences out of substantially differing parallel docu- (Labov, 2001). Also, speakers with native com- ments. As an additional task, the spelling of the petence usually adapt the register to the commu- texts has to be transformed into the above men- tioned orthography of the target variety. nicative situation as well as to the content of the utterances in a very dynamic way. Switching be- 1 Introduction tween varieties and subtle gradual shifts are a very Statistical machine translation between dialectal natural phenomenon in such a linguistic situation. varieties and their cognate standard variety is a While being aware that the linguistic conception challenge quite different from translation between of a dialect is not uncontroversial, we still think major languages with large resources on both that it is feasible and appropriate to model a di- sides. Instead of having huge corpora at hand alectal variety that conforms to a stereotype of that that offer themselves for machine learning tech- dialect. niques, substantial written corpora of dialectal lan- The paper focuses on the generation of the re- guage varieties are rare. In addition, there is no sources necessary for statistical machine transla- authoritative orthography, which calls for methods tion between a standard variety with rich resources to normalize the spelling of existing written texts. (AG) and a dialectal variety (VD) with almost no Parallel resources for a standard language and a resources. The strategy is to create a minimal base dialectal variety thereof are even less common. corpus comprising bilingual data in a standardized But such parallel data is the workhorse of mod- orthography for VD, and in a second step apply- ern machine translation systems and key to pro- ing a bootstrapping strategy in order to gain a suf- ducing sufficiently natural utterances. On the posi- tive side, the relative proximity between a standard 1The work presented in this paper is based on the project ‘Machine Learning Techniques for Modeling of Language language and its varieties opens up new possibili- Varieties’ (MLT4MLV - ICT10-049) funded by the Vienna ties to gather parallel data, despite data sparsity. Science and Technology Fund (WWTF). ficient amount of bilingual lexical resources and 3 Constructing a Parallel Corpus to increase the data on the basis of automatically For Standard German to Viennese dialect, there generated translations. As proximity between the were no existing parallel data sets and, moreover, varieties works on our side, we give detailed de- most monolingual text sources that exist are writ- scriptions of how the linguistic closeness can be ten in an inconsistent way, oscillating between exploited to bootstrap the required resources. standard conventions and free attempts to encode the phonetic realization in the dialect. The first 2 Background step was to design an orthographic standard for Pairs of closely related languages (or language va- the target language that would be consistent, un- rieties) offer themselves to exploit the linguistic ambiguous and phonologically transparent. In the proximity in order to overcome the usual scarcity light of applicability in language technology, accu- of parallel data. Nakov and Tiedemann (2012) racy towards phonological properties seemed the take advantage of the great overlap in vocabu- most important criterion, on a par with the neces- lary and the strong syntactic and lexical similar- sity to minimize lexical ambiguities. This is dif- ity between Bulgarian and Macedonian. They de- ferent from producing literary texts, where read- velop an SMT system for this language pair by ability might be a more prominent issue, and the employing a combination of character and word orientation towards the standard orthography may level translation models, outperforming a phrase- have a higher priority. based word-level baseline. Regarding MT of A second problem with initial data acquisition dialects, Zbib et al. (2012) use crowdsourcing is the fact that dialect speakers in Vienna very of- to build Levantine-English and Egyptian-English ten switch between the dialect and the standard va- parallel corpora; while Sawaf (2010) normalizes riety, depending on the communicative situation, non-standard, spontaneous and dialect Arabic into but also on the content that may invite to use a Modern Standard Arabic to achieve translations higher register. Text data with a bias towards the into English. standard by virtue of standard orthography quite A considerable amount of work has been done often also reflects such switching processes. In on extracting parallel sentences from compara- order to circumvent such biases, we carefully se- ble corpora, i.e. a set of documents in dif- lected colloquial data of VD that are as authen- ferent languages that contains similar informa- tic to the dialect as possible. The basic material tion. Munteanu and Marcu (2005) use a Max- consists of transcripts of TV documentaries and imum Entropy classifier trained on parallel sen- free interview recordings of dialect speakers. The tences to determine if a sentence pair is parallel transcripts were manually translated into both AG or not. Based on techniques of Information Re- and VD, the latter being vacuous in most cases. trieval, Abdul-Rauf and Schwenk (2011) use the This way we could ensure that (rarely occurring) translations of a SMT system in order to find the switchings into the standard would not end up in corresponding parallel sentences from the target- the target model. A typical example looks as fol- language side of the comparable corpus. Smith lows, where AG and VD refer to the standard and et al. (2010) explore Wikipedia to extract paral- the Viennese orthography of a sentence from our lel sentences where, once they achieve an align- corpus. ment at the document level by taking advantage (1) AG: Ja, ich weiß es doch. of the structure of this online encyclopedia, they VD: Ja,˚ i waas s e. train Conditional Random Fields to tackle the task ‘yes, I know it anyway.’ of sentence alignment. Tillmann and Xu (2009) extract sentence pairs by a model based on the In an early stage, we were interested in find- IBM Model-1 (Brown et al., 1993) and perform ing a way to align these parallel sentences on a training on parallel data. With the exception of word-by-word basis, in order to simultaneously Munteanu and Marcu (2005), where bootstrapping generate lexical resources comprising morphology techniques find application, these methods require and morpho-syntactic features (PoS tags, gram- (and presuppose the existence of) a certain amount matical features, such as gender, case, person, of resources (i.e. parallel data or lexicon coverage) number etc.). Given that usually the two transla- not available for some languages or varieties. tions are syntactically very similar, with little re- ordering and/or n-to-n correspondences, and also source and the target sentence are approximately that many corresponding words are ’cognates’, the same. Second, non-adjacent insertion-deletion meaning that they are lexically (and morpholog- pairs with a distance measure below the threshold ically) the same in both varieties, with differ- are marked as valid alignments. That way the al- ent phonology and spelling (e.g., AG ‘weiß’ cor- gorithm that by itself provides only linear align- responds to VD ‘waas’ ”(I) know”), we boot- ments is also capable to capture some non-local strapped a word-alignment routine that very soon alignments resulting from syntactic re-ordering. provided promising results. With regard to SMT and contemplating the im- manent problem of data sparsity, it seems obvi- The core idea was to use the string edit distance ous that a factorized translation model (Koehn and (Levenshtein algorithm) to determine whether two Hoang, 2007) will have certain advantages over words should be aligned or not.

Corpus Development for Machine Translation Between Standard and Dialectal Varieties

(Meta-) Evaluation of Machine Translation

Public Project Presentation and Updates

8.14: Annual Public Report

Statistical and Hybrid Machine Translation Between All European Languages

Abstract the Circle of Meaning: from Translation

Improving Statistical Machine Translation Efficiency by Triangulation

Factored Translation Models and Discriminative Training For

Convergence of Translation Memory and Statistical Machine Translation

Statistical Machine Translation

Edinburgh's Statistical Machine Translation Systems for WMT16

Improved Learning for Machine Translation

A Survey of Statistical Machine Translation