Unsupervised Separation of Transliterable and Native Words For

Unsupervised Separation of Transliterable and Native Words for Malayalam Deepak P Queen’s University Belfast, UK [email protected] Abstract names that need to be transliterated than translated to correlate with English text. On a manual analy- Differentiating intrinsic language words sis of a news article dataset, we found that translit- from transliterable words is a key step erated words and proper nouns each form 10-12% aiding text processing tasks involving dif- of all distinct words. It is useful to transliterate ferent natural languages. We consider such words for scenarios that involve processing the problem of unsupervised separation of Malayalam text in the company of English text; transliterable words from native words for this will avoid them being treated as separate index text in Malayalam language. Outlining a terms (wrt their transliteration) in a multi-lingual key observation on the diversity of char- retrieval engine, and help a statistical translation acters beyond the word stem, we develop system to make use of the link to improve effec- an optimization method to score words tiveness. In this context, it ia notable that there has based on their nativeness. Our method re- been recent interest in devising specialized meth- lies on the usage of probability distribu- ods to translate words that fall outside the core vo- tions over character n-grams that are re- cabulary (Tsvetkov and Dyer, 2015). fined in step with the nativeness scorings In this paper, we consider the problem of sepa- in an iterative optimization formulation. rating out such transliterable words from the other Using an empirical evaluation, we illus- words within an unlabeled dataset; we refer to trate that our method, DTIM, provides sig- the latter as “native” words. We propose an un- nificant improvements in nativeness scor- supervised method, DTIM, that takes a dictio- ing for Malayalam, establishing DTIM as nary of distinct words from a Malayalam corpus the preferred method for the task. and scores each word based on their nativeness. Our optimization method, DTIM, iteratively re- 1 Introduction fines the nativeness scoring of each word, leverag- Malayalam is an agglutinative language from ing dictionary-level statistics modelled using char- the southern Indian state of Kerala where it is acter n-gram probability distributions. Our empiri- the official state language. It is spoken by 38 cal analysis establishes the effectiveness of DTIM. million native speakers, three times as many We outline related work in the area in Section 2. arXiv:1803.09641v1 [cs.CL] 26 Mar 2018 speakers as Hungarian (Vincze et al., 2013) or This is followed by the problem statement in Sec- Greek (Ntoulas et al., 2001), for which specialized tion 3 and the description of our proposed ap- techniques have been developed in other contexts. proach in Section 4. Our empirical analysis forms The growing web presence of Malayalam neces- Section 5 followed by conclusions in Section 7. sitates automatic techniques to process Malay- 2 Related Work alam text. A major hurdle in harnessing Malay- alam text from social and web media for multi- Identification of transliterable text fragments, be- lingual retrieval and machine translation is the ing a critical task for cross-lingual text analysis, presence of a large amount of transliterable words. has attracted attention since the 1990s. While By transliterable words, we mean both (a) words most methods addressing the problem have used (from English) like police and train that virtually supervised learning, there have been some meth- always appear in transliterated form in contem- ods that can work without labeled data. We briefly porary Malayalam, and (b) proper nouns such as survey both classes of methods. 2.1 Supervised and ‘pseudo-supervised’ 2.3 Positioning the Transliterable Word Methods Identification Task Nativeness scoring of words may be seen as a vo- An early work(Chen and Lee, 1996) focuses on cabulary stratification step (upon usage of thresh- a sub-problem, that of supervised identification olds) for usage by downstream applications. A of proper nouns for Chinese. (Jeong et al., 1999) multi-lingual text mining application that uses consider leveraging decision trees to address Malayalam and English text would benefit by the related problem of learning transliteration transliterating non-native Malayalam words to En- and back-transliteration rules for English/Korean glish, so the transliterable Malayalam token and word pairs. Recognizing the costs of procur- its transliteration is treated as the same token. ing training data, (Baker and Brew, 2008) and For machine translation, transliterable words may (Goldberg and Elhadad, 2008) explore usage be channeled to specialized translation methods of pseudo-transliterable words generated using (e.g., (Tsvetkov and Dyer, 2015)) or for manual transliteration rules on an English dictionary screening and translation. for Korean and Hebrew respectively. Such pseudo-supervision, however, would not be able 3 Problem Definition to generate uncommon domain-specific terms We now define the problem more formally. Con- such as medical/scientific terminology for usage sider n distinct words obtained from Malayalam in such domains (unless specifically tuned), and is text, W = {...,w,...}. Our task is to devise hence limited in utility. a technique that can use W to arrive at a nativeness score for each word, w, within it, as wn. We would like wn to be an accurate quantification of native-ness of word w. For example, when words 2.2 Unsupervised Methods in W are ordered in the decreasing order of wn scores, we expect to get the native words at the A recent work proposes that multi-word phrases beginning of the ordering and vice versa. We do in Malayalam text where their component words not presume availability of any data other than W; exhibit strong co-occurrence be categorized as this makes our method applicable across scenar- transliterable phrases (Prasad et al., 2014). Their ios where corpus statistics are unavailable due to intuition stems from observing contiguous words privacy or other reasons. such as test dose which often occur in transliterated form while occurring together, but get re- 3.1 Evaluation placed by native words in other contexts. Their Given that it is easier for humans to crisply classify method is however unable to identify single each word as either native or transliterable (nouns transliterable words, or phrases involving words or transliterated english words) in lieu of attaching such as train and police whose transliterations are a score to each word, the nativeness scoring (as heavily used in the company of native Malayalam generated by a scoring method such as ours) often words. A recent method for Korean (Koo, 2015) needs to be evaluated against a crisp nativeness as- starts by identifying a seed set of transliterable sessment, i.e., a scoring with scores in {0, 1}. To words as those that begin or end with consonant aid this, we consider the ordering of words in the clusters and have vowel insertions; this is spe- labeled set in the decreasing (or more precisely, cific to Korean since Korean words apparently do non-increasing) order of nativeness scores (each not begin or end with consonant clusters. High- method produces an ordering for the dataset). To frequency words are then used as seed words for evaluate this ordering, we use two sets of metrics native Korean for usage in a Naive Bayes classi- for evaluation: fier. In addition to the outlined reasons that make both the unsupervised methods inapplicable for • Precision at the ends of the ordering: Top- our task, they both presume availability of corpus k precision denotes the fraction of native frequency statistics. We focus on a general sce- words within the k words at the top of the nario assuming the availability of only a word lex- ordering; analogously, Bottom-k precision icon. is the fraction of transliterable words among the bottom k. Since a good scoring would more strongly and weaken any initial preference to likely put native words at the top of the order- transliterable words. The vice versa holds for the ing and the transliterable ones at the bottom, a transliterable word models. We will first outline good scoring method would intuitively score the initialization step followed by the description high on both these metrics. We call the aver- of the method. age of the top-k and bottom-k precision for a given k, as Avg-k precision. These measures, 4.1 Diversity-based Initialization evaluated at varying values of k, indicate the Our initialization is inspired by an observation on quality of the nativeness scoring. the variety of suffixes attached to a word stem. |pu|ra|2 • Clustering Quality: Consider the cardinal- Consider a word stem , a stem commonly ities of the native and transliterable words leading to native Malayalam words; its suffixes from the labeled set as being N and T re- are observed to start with a variety of charac- spectively. We now take the top-N words and ters such as |ttha| (e.g., |pu|ra|ttha|kki|), |me| bottom-T words from the ordering generated (e.g., |pu|ra|me|), |mbo| (e.g., |pu|ra|mbo|kku|) by each method, and compare against the re- and |ppa| (e.g., |pu|ra|ppa|du|). On the other hand, spective labeled sets as in the case of stan- stems that mostly lead to transliterable words of- dard clustering quality evaluation1. Since the ten do not exhibit so much of diversity. For exam- cardinalities of the generated native (translit- ple, |re|so| is followed only by |rt| (i.e., resort) and erable) cluster and the native (transliterable) |po|li| is usually only followed by |s| (i.e., police). labeled set is both N (T ), the Recall of the Some stems such as |o|ppa| lead to transliterations cluster is identical to its Purity/Precision, of two English words such as open and operation.

Unsupervised Separation of Transliterable and Native Words For

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support