Recurrent Neural Network Based Loanwords Identification in Uyghur

PACLIC 30 Proceedings Recurrent Neural Network Based Loanwords Identification in Uyghur Chenggang Mi1,2, Yating Yang1,2,3, Xi Zhou1,2, Lei Wang1,2, Xiao Li1,2 and Tonghai Jiang1,2 1Xinjiang Technical Institute of Physics and Chemistry of Chinese Academy of Sciences, Urumqi, 830011, China 2Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry of Chinese Academy of Sciences, Urumqi, 830011, China 3Institute of Acoustics of the Chinese Academy of Sciences, Beijing, 100190, China {micg,yangyt,zhouxi,wanglei,xiaoli,jth}@ms.xjb.ac.cn Abstract as an important factor in comparable corpora build- ing. And comparable corpora are vital resources in Comparable corpus is the most important resource in several NLP tasks. However, it is parallel corpus detection (Munteanu et al., 2006). very expensive to collect manually. Lexical Additionally, loanwords can be integrated into bilin- borrowing happened in almost all languages. gual dictionaries directly. Therefore, loanwords are We can use the loanwords to detect useful valuable to study in several NLP tasks such as ma- bilingual knowledge and expand the size of chine translation, information extraction and infor- donor-recipient / recipient-donor comparable mation retrieval. corpora. In this paper, we propose a recur- In this paper, we design a novel model to iden- rent neural network (RNN) based framework tify loanwords (Chinese, Russian and Arabic) from to identify loanwords in Uyghur. Addition- ally, we suggest two features: inverse lan- Uyghur texts. Our model based on a RNN Encoder- guage model feature and collocation feature Decoder framework (Cho et al., 2014). The En- to improve the performance of our model. Ex- coder processes a variable length input (Uyghur sen- perimental results show that our approach out- tence) and builds a fixed-length vector representa- performs several sequence labeling baselines. tion. Based on the encoded representation, the decoder generates a variable-length sequence (Labeled 1 Introduction sequence). To optimize the output of decoder, we also propose two important features: inverse lan- Most natural language processing (NLP) tools rely guage model feature and collocation feature. We on large scale language resources, but many lan- conduct three groups of experiments; experimental guages in the world are resource-poor. To make results show that, our model outperforms other ap- these NLP tools widely used, some researchers proaches. have focused on techniques that obtain resources This paper makes the following contributions to of resource-poor languages from resource-rich lan- this area: guages using parallel data for NLP applications such as syntactic parsing, word sense tagging, machine • We introduce a novel approach to loanwords translation, semantic role labeling, and some cross- identification in Uyghur. This approach in- lingual NLP tasks. However, high quality parallel creases F1 score by 12% relative to traditional corpora are expensive and difficult to obtain, es- approach on the task of loanwords detection. pecially for resource-poor languages like Uyghur. Lexical borrowing is very common between lan- • We conduct experiments to evaluate the per- guages. It is a phenomenon of cross-linguistic in- formance of off-the-shelf loanwords detection fluence (Tsvetkov et al., 2015a). If loanwords in tools trained on news corpus when applied to resource-poor languages (e.g. Uyghur) can be iden- loanwords detection. By utilizing in-domain tified effectively, we can use the bilingual word pairs and out-of-domain data. 209 • For integrate these crucial information for bet- in both social and official spheres, as well as in ter loanwords prediction, we combine two fea- print, radio and television, and is mostly used as a tures into the loanwords identification model, lingua franca by other ethnic minorities in Xinjiang. so that we can use more important information Uyghur belongs to the Turkic language family, to select the better loanword candidate. which also includes languages such as the more distantly related Uzbek. In addition to influence The rest of this paper is organized as follows: of other Turkic languages, Uyghur has historically Section 2 presents the background of loanwords in been influenced strongly by Person and Arabic and Uyghur; Section 3 interprets the framework used more recently by Mandarin Chinese and Russian in our model; Section 4 introduces our method in (Table 1). detail. Section 5 describes the experimental setup Loanwords in Uyghur not only include named and the analysis of experimental results. Section 6 entities such as person and location names, but also discusses the related work. Conclusion and future some daily used words. work are presented in Section 7. 2.3 Challenges in Loanwords Identification in 2 Background Uyghur Before we present our loanwords detection model, Spelling Change When Borrowed From Donor we provide a brief introduction of Uyghur and loan- Languages words identification in this section. This will help build relevant background knowledge. To adopt the pronunciation and grammar in Uyghur, 2.1 Introduction of Loanwords spelling of words (loanwords) may change when borrowed from donor languages. Changes of A loanword is a word adopted from one language spelling have a great impact on the loanwords (the donor language) and incorporated into a differ- identification task. ent, recipient language without translation. It can Russian loanwords in Uyghur: be distinguished from a calque, or loan translation, “radiyo”1-“radio”(“radio”) where a meaning or idiom from another language Chinese loanwords in Uyghur: is translated into existing words or roots of the host “koi”-“” (“kuai”) language. When borrowing, the words may have several changes to adopt the recipient language: Suffixes of Uyghur Words Affect the Loanwords • Changes in meaning. Words are occasionally Identification improved with a different meaning than that in the donor language A Uygur word is composed of a stem and several suffixes, which can be formally described as: • Changes in spelling. Words taken into different recipient languages are something spelled Word = stem + suffix0 + suffix1 + ... + suffixN (1) as in the donor language. Sometimes borrowed words retain original (or near-original) pronun- If we just use the traditional approaches such ciation, but undergo a spelling change to repre- as edit distance, in some cases, these algorithms sent the orthography of the recipient language. cannot give us sure results, for example, the length of suffixes equal even greater than the original • Changes in pronunciation. In cases where a word’s length. new loanword has a very unusual sound, the pronunciation of the word is radically changed. Data Sparsity Degrades the Performance of 2.2 Loanwords in Uyghur Loanwords Identification Model Uyghur is an official language of the Xinjiang Uyghur Autonomous Region, and is widely used 1In this paper, we use Uyghur Latin Alphabet. 210 PACLIC 30 Proceedings Chinese loan words in Uyghur [in English] Russian loan words in Uyghur [in English] shinjang() [Xinjiang] tElEfon(telefon) [telephone] laza() [hot pepper] uniwErsitEt(universitet) [university] shuji() [secretary] radiyo(radio) [radio] koi() [Yuan] pochta(poqta) [post office] lengpung() [agar-agar jelly] wElsipit(velosiped) [bicycle] dufu() [bean curd] oblast(oblast) [region] Table 1: Examples of Chinese and Russian Loanwords in Uyghur. Loanwords detection can be reformulated as a se- 3.2 RNN Encoder-Decoder Framework quence labeling problem. Most sequence labeling In this section, we give a brief introduction of the tools (such as CRFs-based, HMM-based etc.) are RNN Encoder-Decoder framework, which was built on large scales labeled data, lack of available proposed by (Cho et al., 2014a) and (Sutskever et labeled language resource makes decrease of perfor- al., 2014). We build a novel architecture that learns mance on loanwords identification in Uyghur using to identify loanwords in Uyghur texts based on this above ”off-the-shelf” tools. framework. 3 Methodology Recent development of deep learning (representa- ሾܱͲǡܱͳǡܱʹǡǥǡܱሺ݈െͳሻǡܱ݈ሿ tion learning) has a strong impact in the area of Decode NLP (natural language processing). According to traditional approaches, extraction of features often requires expensive human labor and often relies on ሾݖݒͲǡݖݒͳǡݖݒʹǡǥǡݖݒሺ݈െͳሻǡݖݒ݈ ሿ expert knowledge, and these features usually cannot be expended in other situations. The most exciting Encode thing of deep learning is that features used in most traditional machine learning models can be learned ሾܷͲǡܷͳǡܷʹǡǥǡܷሺ݈െͳሻǡܷ݈ሿ automatically. In this section, we first introduced the most popu- Figure 1: The Encoder-Decoder Framework Used in lar deep learning models used in this paper, then, we Loanword Identification Model. involved in the details of this model. U0,U1,U2, ..., Ul−1,Ul is a sequence of 3.1 Recurrent Neural Network Uyghur words, O0,O1,O2, ..., Ol−1,Ol is a RNNs (Recurrent Neural Networks) are artificial sequence of labels (loanword or not), and neural network models where connections between Zv0,Zv1,Zv2, ..., Zv(l−1),Zvl is a sequence of units form a directed cycle (Jaeger, 2002). This cre- vector representation of Uyghur words. The bold ates an internal state of the network which allows it face ”Encode” and ”Decode” are two processes of to exhibit dynamic temporal behavior. RNNs can encoder and decoder in our loanword identification use their internal memory to process arbitrary se- model, respectively (Figure

Recurrent Neural Network Based Loanwords Identification in Uyghur

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support