Unsupervised Separation of Transliterable and Native for

Padmanabhan, D. (2017). Unsupervised Separation of Transliterable and Native Words for Malayalam. In Proceedings of the 14th International Conference on Natural Processing (ICON 2017) (pp. 155-164) https://ltrc.iiit.ac.in/icon2017/proceedings/icon2017/

Published in: Proceedings of the 14th International Conference on Natural Language Processing (ICON 2017)

Document Version: Peer reviewed version

Queen's University Belfast - Research Portal: Link to publication record in Queen's University Belfast Research Portal

Publisher rights © 2016 NLP Association of India (NLPAI) This work is made available online in accordance with the publisher’s policies. Please refer to any applicable terms of use of the publisher.

General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.

Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact [email protected].

Download date:28. Sep. 2021 Unsupervised Separation of Transliterable and Native Words for Malayalam

Deepak P Queen’s University Belfast, UK [email protected]

Abstract be transliterated than translated to correlate with English text. On a manual analysis of a news ar- Differentiating intrinsic language words ticle dataset, we found that transliterated words from transliterable words is a key step and proper each form 10-12% of all dis- aiding text processing tasks involving dif- tinct words. It is useful to transliterate such words ferent natural . We consider for scenarios that involve processing Malayalam the problem of unsupervised separation of text in the company of English text; this will avoid transliterable words from native words for them being treated as separate index terms (wrt text in Malayalam language. Outlining a their transliteration) in a multi-lingual retrieval en- key observation on the diversity of char- gine, and help a statistical translation system to acters beyond the stem, we develop make use of the link to improve effectiveness. In an optimization method to score words this context, it ia notable that there has been re- based on their nativeness. Our method re- cent interest in devising specialized methods to lies on the usage of probability distribu- translate words that fall outside the core vocabu- tions over character n-grams that are re- lary (Tsvetkov and Dyer, 2015). fined in step with the nativeness scorings In this paper, we consider the problem of sepa- in an iterative optimization formulation. rating out such transliterable words from the other Using an empirical evaluation, we illus- words within an unlabeled dataset; we refer to trate that our method, DTIM, provides sig- the latter as “native” words. We propose an un- nificant improvements in nativeness scor- supervised method, DTIM, that takes a dictio- ing for Malayalam, establishing DTIM as nary of distinct words from a Malayalam corpus the preferred method for the task. and scores each word based on their nativeness. Our optimization method, DTIM, iteratively re- 1 Introduction fines the nativeness scoring of each word, leverag- Malayalam is an agglutinative language from the ing dictionary-level statistics modelled using char- southern Indian state of Kerala where it is the of- acter n-gram probability distributions. Our empiri- ficial state language. It is spoken by 38 million cal analysis establishes the effectiveness of DTIM. native speakers, three times as many speakers as We outline related work in the area in Section 2. Hungarian (Vincze et al., 2013) or Greek (Ntoulas This is followed by the problem statement in Sec- et al., 2001), for which specialized techniques tion 3 and the description of our proposed ap- have been developed in other contexts. The grow- proach in Section 4. Our empirical analysis forms ing web presence of Malayalam necessitates au- Section 5 followed by conclusions in Section 7. tomatic techniques to process Malayalam text. A 2 Related Work major hurdle in harnessing Malayalam text from social and web media for multi-lingual retrieval Identification of transliterable text fragments, be- and machine translation is the presence of a large ing a critical task for cross-lingual text analysis, amount of transliterable words. By transliterable has attracted attention since the 1990s. While words, we mean both (a) words (from English) most methods addressing the problem have used like police and train that virtually always appear supervised learning, there have been some meth- in transliterated form in contemporary Malayalam, ods that can work without labeled data. We briefly and (b) proper nouns such as names that need to survey both classes of methods. 2.1 Supervised and ‘pseudo-supervised’ olds) for usage by downstream applications. A Methods multi-lingual text mining application that uses An early work(Chen and Lee, 1996) focuses on a Malayalam and English text would benefit by sub-problem, that of supervised identification of transliterating non-native Malayalam words to En- proper nouns for Chinese. (Jeong et al., 1999) glish, so the transliterable Malayalam token and consider leveraging decision trees to address the its transliteration is treated as the same token. related problem of learning transliteration and For machine translation, transliterable words may back-transliteration rules for English/Korean word be channeled to specialized translation methods pairs. Recognizing the costs of procuring train- (e.g., (Tsvetkov and Dyer, 2015)) or for manual ing data, (Baker and Brew, 2008) and (Gold- screening and translation. berg and Elhadad, 2008) explore usage of pseudo- 3 Problem Definition transliterable words generated using translitera- tion rules on an English dictionary for Korean and We now define the problem more formally. Con- Hebrew respectively. Such pseudo-supervision, sider n distinct words obtained from Malayalam however, would not be able to generate uncommon text, W = {. . . , w, . . .}. Our task is to devise domain-specific terms such as medical/scientific a technique that can use W to arrive at a native- terminology for usage in such domains (unless ness score for each word, w, within it, as wn. We specifically tuned), and is hence limited in utility. would like wn to be an accurate quantification of native-ness of word w. For example, when words 2.2 Unsupervised Methods in W are ordered in the decreasing order of wn A recent work proposes that multi-word phrases scores, we expect to get the native words at the in Malayalam text where their component words beginning of the ordering and vice versa. We do exhibit strong co-occurrence be categorized as not presume availability of any data other than W; transliterable phrases (Prasad et al., 2014). Their this makes our method applicable across scenar- intuition stems from observing contiguous words ios where corpus statistics are unavailable due to such as test dose which often occur in translit- privacy or other reasons. erated form while occurring together, but get re- placed by native words in other contexts. Their 3.1 Evaluation method is however unable to identify single Given that it is easier for humans to crisply classify transliterable words, or phrases involving words each word as either native or transliterable (nouns such as train and police whose transliterations are or transliterated english words) in lieu of attaching heavily used in the company of native Malayalam a score to each word, the nativeness scoring (as words. A recent method for Korean (Koo, 2015) generated by a scoring method such as ours) often starts by identifying a seed set of transliterable needs to be evaluated against a crisp nativeness as- words as those that begin or end with consonant sessment, i.e., a scoring with scores in {0, 1}. To clusters and have vowel insertions; this is spe- aid this, we consider the ordering of words in the cific to Korean since Korean words apparently do labeled set in the decreasing (or more precisely, not begin or end with consonant clusters. High- non-increasing) order of nativeness scores (each frequency words are then used as seed words for method produces an ordering for the dataset). To native Korean for usage in a Naive Bayes classi- evaluate this ordering, we use two sets of metrics fier. In addition to the outlined reasons that make for evaluation: both the unsupervised methods inapplicable for our task, they both presume availability of corpus • Precision at the ends of the ordering: Top- frequency statistics. We focus on a general sce- k precision denotes the fraction of native nario assuming the availability of only a word lex- words within the k words at the top of the icon. ordering; analogously, Bottom-k precision is the fraction of transliterable words among 2.3 Positioning the Transliterable Word the bottom k. Since a good scoring would Identification Task likely put native words at the top of the order- Nativeness scoring of words may be seen as a vo- ing and the transliterable ones at the bottom, a cabulary stratification step (upon usage of thresh- good scoring method would intuitively score high on both these metrics. We call the aver- the initialization step followed by the description age of the top-k and bottom-k precision for a of the method. given k, as Avg-k precision. These measures, evaluated at varying values of k, indicate the 4.1 Diversity-based Initialization quality of the nativeness scoring. Our initialization is inspired by an observation on the variety of suffixes attached to a word stem. • Clustering Quality: Consider the cardinal- Consider a word stem |pu|ra|2, a stem commonly ities of the native and transliterable words leading to native Malayalam words; its suffixes from the labeled set as being N and T re- are observed to start with a variety of charac- spectively. We now take the top-N words and ters such as |ttha| (e.g., |pu|ra|ttha|kki|), |me| bottom-T words from the ordering generated (e.g., |pu|ra|me|), |mbo| (e.g., |pu|ra|mbo|kku|) by each method, and compare against the re- and |ppa| (e.g., |pu|ra|ppa|du|). On the other hand, spective labeled sets as in the case of stan- stems that mostly lead to transliterable words of- dard clustering quality evaluation1. Since the ten do not exhibit so much of diversity. For exam- cardinalities of the generated native (translit- ple, |re|so| is followed only by |rt| (i.e., resort) and erable) cluster and the native (transliterable) |po|li| is usually only followed by |s| (i.e., police). labeled set is both N (T ), the Recall of the Some stems such as |o|ppa| lead to transliterations cluster is identical to its Purity/Precision, of two English words such as open and operation. and thus, the F-measure too; we simply Our observation, upon which we model the initial- call it Clustering Quality. A cardinality- ization, is that the variety of suffixes is generally weighted average of the clustering quality correlated with native-ness (i.e., propensity to lead across the native and transliterable clusters to a native word) of word stems. This is intuitive yields a single value for the clustering quality since non-native word stems provide limited flex- across the dataset. It may be noted that we are ibility to being modified by derivational or inflec- not making the labeled dataset available to tional suffixes as compared to native ones. the method generating the ordering, instead For simplicity, we use the first two characters of merely using it’s cardinalities for evaluation each word as the word stem; we will evaluate the purposes. robustness of DTIM to varying stem lengths in our 4 Our Method: DTIM empirical evaluation, while consistently using the stem length of two characters in our description. We now introduce our method, Diversity-based We start by associating each distinct word stem in Transliterable Word Identification for Malayalam W with the number of unique third characters that (DTIM). We use probability distributions over follow it (among words in W); in our examples, character n-grams to separately model transliter- |pu|ra| and |o|pa| would be associated with 4 and 2 able and native words, and develop an optimiza- respectively. We initialize the native-ness weights tion framework that alternatively refines the n- as proportional to the diversity of 3rd characters gram distributions and nativeness scoring within beyond the stem: each iteration. DTIM involves an initialization that induces a “coarse” separation between na-   tive and transliterable words followed by iterative |u3(wstem, W)| wn0 = min 0.99, (1) refinement. The initialization is critical in opti- τ mization methods that are vulnerable to local op- where u3(wstem, W) denotes the set of third char- tima; the pure word distribution needs to be initial- acters that follow the stem of word w among ized to “coarsely” prefer pure words over translit- words in W. We flatten off wn scores beyond erable words. This will enable further iterations 0 a diversity of τ (note that a diversity of τ or to exploit the initial preference direction to fur- higher will lead to the second term becoming 1.0 ther refine the model to “attract” the pure words or higher, kicking in the min function to choose more strongly and weaken any initial preference to 0.99 for wn ) as shown in the above equation. transliterable words. The vice versa holds for the 0 2 transliterable word models. We will first outline We will represent Malayalam words in transliterated form for reading by those who might not be able to read Malayalam. A pipe would separate 1 https://nlp.stanford.edu/IR-book/html/ Malayalam characters; for example |pu| corresponds to a single Malayalam htmledition/evaluation-of-clustering-1.html character. We give a small transliterable-ness weight even to Conversely, by striving to maximize the objective highly diverse stems to reduce over-reliance on the function, we would arrive at a desirable estimate initialization. We set τ = 10 based on our obser- of the variables. An alternative construction yield- vation from the dataset that most word stems hav- ing a minimizing objective would be as follows: ing more than 10 distinct characters were seen to be native. As in the case of word stem length, we   Y Y 2 2 study DTIM trends across varying τ in our empir- (1 − wn) × N (c) + wn × T (c) (3) ical analysis. wn0 is in [0, 1]; analogously, (1 − w∈W c∈w wn0 ) may be regarded as a score of transliterable- ness. In this form, given a good estimate of the vari- ables, the native (transliterable) words have their 4.2 Objective Function and Optimization nativeness (transliterableness) weights multiplied Framework with the support from the transliterable (native) As outlined earlier, we use separate character n- models. In other words, maximizing the objective gram probability distributions to model native and in Eq. 2 is semantically similar to minimizing the transliterable words. We would like these proba- objective in Eq. 3. As we will illustrate soon, it bility distributions support the nativeness scoring, is easier to optimize for N and T using the maxi- and vice versa. While the size of the n-grams (i.e., mizing formulation in Eq. 2 while the minimizing whether n = 1, 2, 3 or 4) is a system-level param- objective in Eq. 3 yields better to optimize for the eter, we use n = 1 for simplicity in our descrip- word nativeness scores, {. . . , wn,...}. tion. We denote the native and transliterable dis- 4.3 Learning N and T using the Maximizing tributions as N and T respectively, with N (c) and Objective T (c) denoting the weight associated with the char- acter c according to the distributions. Consider the We start by taking the log-form of the objective in following function, given a particular state for the Eq. 2 (this does not affect the optimization direc- tion), yielding: N , T and wns:

    X X 2 2 Y Y 2 2 Omax = ln wn×N (c)+(1−wn) ×T (c) wn × N (c) + (1 − wn) × T (c) (2) w∈W c∈w w∈W c∈w (4) This measures the aggregate supports for words The distributions, being probability distribu- in W, the support for each word measured as an tions over n-grams, should sum to zero. This con- interpolated support from across the distributions straint, for our unigram models, can be written as: N T and with weighting factors squares of the X X nativeness scores (i.e., wns) and transliterableness N (c) = T (c) = 1 (5) c c scores (i.e., (1 − wn)s) respectively. Similar mix- ing models have been used earlier in emotion lex- Fixing the values of {. . . , wn,...} and T (or icon learning (Bandhakavi et al., 2014) and so- N ), we can now identify a better estimate for N lution post discovery (Deepak and Visweswariah, (or T ) by looking for an optima (i.e., where the 2014). The squares of the nativeness scores are objective function has a slope of zero). Towards used in our model (instead of the raw scores) for that, we take the partial derivative (or slope) of the optimization convenience. A highly native word objective for a particular character. : should intuively have a high wn (nativeness) and a high support from N and correspondingly low  0 2  transliterable-ness (i.e., (1 − w )) and support ∂Omax X freq(c , w) × wn n = +λN ∂N (c0) w2 N (c0) + (1 − w )2T (c0) from T ; a highly transliterable word would be ex- w∈W n n pected to have exactly the opposite. Due to the de- (6) sign of Eq. 2 in having the higher terms multiplied where freq(c0, w) is the frequency of the char- 0 with each other (and so for the lower terms), this acter c in w and λN denotes the Lagrangian multi- function would be maximized for a desirable esti- plier corresponding to the sum-to-unity constraint mate of the variables θ = {N , T , {. . . , wn,...}}. for N . Equating this to zero does not however yield a closed form solution for N 0, but a simple 4.4 Learning the nativeness scores, re-arrangement yields an iterative update formula: {. . . , wn,...}, using the Minimizing Objective Analogous to the previous section, we take the X freq(c0, w) × w2 N (c0) ∝ n (7) log-form of Eq. 3: 2 2 T (c0)  w + (1 − wn) 0 w∈W n NP (c )   X X 2 2 Omin = ln (1−wn) ×N (c)+wn×T (c) The N term in the RHS is denoted as NP w∈W c∈w to indicate the usage of the previous estimate of (10) N . The sum-to-one constraint is trivially achieved Unlike the earlier case, we do not have any con- 0 by first estimating the N (c )s by treating Eq. 7 straints since the sum-to-unit constraint on the na- as equality, followed by normalizing the scores tiveness and transliterableness scores are built in across the character vocabulary. Eq. 7 is intuitively into the construction. We now fix the values of all reasonable, due to establishing a somewhat direct 0 other variables and find the slope wrt wn, where relationship between N and wn (in the numer- w0 indicates a particular word in W. ator), thus allowing highly native words to con- tribute more to building N . The analogous update 0 0 ∂Omin X 2w T (c) + 2w N (c) − 2N (c) formula for T fixing N turns out to be: = n n ∂w0 w02T (c) + (1 − w0 )2N (c) n c∈w0 n n (11) 0 2 X freq(c , w) × (1 − wn) We equate the slope to zero and form an itera- T (c0) ∝ (8) 2 2 N (c0)  tive update formula, much like in the distribution (1 − wn) + wn 0 w∈W TP (c ) estimation phase.

Eq. 7 and Eq. 8 would lead us closer to a max- P N (c) ima for Eq. 4 is their second (partial) derivatives 0 02 0 2 0 c∈w (wn T (c)+(1−wn) N (c)) 3 w = (12) are negative . To verify this, we note that the sec- n P N (c)+T (c) 0 c∈w0 02 0 2 ond (partial) derivative wrt N (c ) is as follows (wn T (c)+(1−wn) N (c)) 0 Using the previous estimates of wn for the RHS yields an iterative update form for the nativeness ∂2O max = scores. As in the model estimation phase, the up- ∂2N (c0) date rule establishes a reasonably direct relation- 0 ship between wn and N . Since our objective is 0 2 2 X freq(c , w)(w ) to minimize Omin, we would like to verify the (−1)× n (9) 2 0 2 0 2 direction of optimization using the second partial w N (c ) + (1 − wn) T (c ) w∈W n derivative.

It is easy to observe that the RHS is a prod- 2 uct of −1 and a sum of a plurality of positive ∂ Omin 2 0 = terms (square terms that are trivially positive, with ∂ wn the exception being the freq(., .) term which is 0 0 2 X N (c)T (c) − wnT (c) − (1 − wn)N (c) also non-negative by definition), altogether yield- 02 0 2 2 ing a negative value. That the the second (partial) c∈w0 wn T (c) + (1 − wn) N (c) derivative is negative confirms that the update for- (13) mula derived from the first partial derivative in- We provide an informal argument for the pos- 0 itivity of the second derivative; note that the de- deed helps in maximizing Omax wrt N (c ).A similar argument holds for the T (c0) updates as nominator is a square term making it enough to well, which we omit for brevity. analyze just the numerator term. Consider a highly 0 native word (high wn) whose characters would in- 3http://mathworld.wolfram.com/ tuitively satisfy N (c) > T (c). For the bound- 0 SecondDerivativeTest.html ary case of wn = 1, the numerator term reduces to T (c) × (N (c) − T (c)) which would be posi- ous character triplets and correspondingly learning tive given the expected relation between N (c) and the distributions N and T over triplets. The DTIM T (c). A similar argument holds for highly translit- structure is evidently inspired by the Expectation- 0 erable words. For words with wn → 0.5 where we Maximization framework (Dempster et al., 1977) would expect N (c) ≈ T (c), the numerator be- involving alternating optimizations of an objective comes N (c)T (c)−0.25(T (c)−N (c))2, which is function; DTIM, however, uses different objective expected to be positive since the difference term functions for the two steps for optimization conve- is small, making it’s square very small in com- nience. parison to the first product term. To outline the informal nature of the argument, it may be noted 5 Experiments that T (c) > N (c) may hold for certain characters We now describe our empirical study of DTIM, within highly native words; but as long as most starting with the dataset and experimental setup of the characters within highly native words sat- leading on to the results and analyses. isfy the N (c) > T (c), there would be sufficient positivity to offset the negative terms induced with 5.1 Dataset such outlier characters. We evaluate DTIM on a set of 65068 distinct words from across news articles sourced from Algorithm 1: DTIM Mathrubhumi4, a popular Malayalam newspaper; Input: A set of Malayalam words, W this word list is made available publicly5. For eval- Output: A nativeness scoring wn ∈ [0, 1] for uation purposes, we got a random subset of 1035 every word w in W words labeled by one of three human annotators; Hyper-parameters: word stem length, τ, n that has been made available publicly6 too, each Initialize the wn scores for each word using word labeled as either native, transliterable or un- the diversity metric in Eq. 1 using the chosen known. There were approximately 3 native words stem length and τ for every transliterable word in the labeled set, while not converged and number of iterations reflective of distribution in contemporary Malay- not reached do alam usage as alluded to in the introduction. We Estimate n-gram distributions N and T will use the whole set of 65068 words as input to using Eq. 7 and Eq. 8 respectively the method, while the evaluation would obviously Learn nativeness weights for each word be limited to the labelled subset of 1035 words. using Eq. 12 end 5.2 Baselines return latest estimates of nativeness weights As outlined in Section 2, the unsupervised ver- sion of the problem of telling apart native and transliterable words for Malayalam and/or simi- lar languages has not been addressed in literature, 4.5 DTIM: The Method to the best of our knowledge. The unsupervised Having outlined the learning steps, the method Malayalam-focused method(Prasad et al., 2014) is simply an iterative usage of the learning steps (Ref: Sec 2.2) is able to identify only transliter- as outlined in Algorithm 1. In the first invoca- able word-pairs, making it inapplicable for con- tion of the distribution learning step where previ- texts such as our health data scenario where in- ous estimates are not available, we simply assume dividual english words are often transliterated for a uniform distribution across the n-gram vocab- want of a suitable malayalam alternative. The ulary for usage as the previous estimates. Each Korean method(Koo, 2015) is too specific to Ko- of the update steps are linear in the size of the rean language and cannot be used for other lan- dictionary, making DTIM a computationally light- guages due to the absence of a generic high- weight method. Choosing n = 2 instead of uni- precision rule to identify a seed set of transliter- grams (as used in our narrative) is easy since that able words. With both the unsupervised state-of- simply involves replacing the c ∈ w all across the the-art approaches being inapplicable for our task, update steps by [c , c ] ∈ w with [c , c ] de- i i+1 i i+1 4http://www.mathrubhumi.com 5 noting pairs of contiguous characters within the Dataset: https://goo.gl/DOsFES 6 word; similarly, n = 3 involves usage of contigu- Labeled Set: https://goo.gl/XEVLWv we compare against an intuitive generalization- on each metric on each value of k, convincingly based baseline, called GEN, that orders words establishing the effectiveness of the DTIM for- based on their support from the combination of mulation. DTIM is seen to be much more effec- a unigram and bi-gram character language model tive in separating out the native and transliterable learnt over W; this leads to a scoring as follows: words at either ends of the ordering, than the base- lines. It is also notable that EM-style iterations are able to significantly improve upon the initial- GEN wn = ization (i.e., INIT). That the bottom-k precision is Y seen to be consistently lower than top-k precision λ × BW (ci+1|ci) + (1 − λ) × UW (ci+1) needs to be juxtaposed in the context of the ob- [ci,ci+1]∈w servation that there were only 25% transliterable (14) words against 75% native words; thus, the lift in where BW and UW are bigram and unigram precision against a random ordering is much more character-level language models built over all substantial for the transliterable words. The trends words in W. We set λ = 0.8 (Smucker and Al- across varying n-gram sizes (i.e., n) in DTIM is lan, 2006). We experimented with higher-order worth noting too; the higher values of n (such models in GEN, but observed drops in evalua- as 3 and 4) are seen to make more errors at the tion measures leading to us sticking to the usage ends of the lists, whereas they catch-up with the of the unigram and bi-gram models. The form of n ∈ {1, 2} versions as k increases. This indi- Eq. 14 is inspired by an assumption similar to that cates that smaller-n DTIM is being able to tell used in both (Prasad et al., 2014) and (Koo, 2015) apart a minority of the words exceedingly well that transliterable words are rare. Thus, we ex- (wrt native-ness), whereas the higher n-gram mod- pect they would not be adequately supported by elling is able to spread out the gains across a larger models that generalize over whole of W. We also spectrum of words in W. Around n = 4 and be- compare against our diversity-based initialization yond, sparsity effects (since 4-grams and 5-grams score from Section 4.1, which we will call as would not occur frequently, making it harder to INIT. For ease of reference, we outline the INIT exploit their occurrence statistics) are seen to kick scoring: in, causing reductions in precision.

 |u3(w , W)| wINIT = min 0.99, stem (15) 5.4.2 Clustering Quality n τ Table 2 lists the clustering quality metric across The comparison against INIT enables us to iso- the various methods. Clustering quality, unlike the late and highlight the value of the iterative update precision metrics, is designed to evaluate the entire formulation vis-a-vis the initialization. ordering without limiting the analysis to just the 5.3 Evaluation Measures and Setup top-k and bottom-k entries. As in the earlier case, DTIM convincingly outperforms the baselines by As outlined in Section 3, we use top-k, bottom-k healthy margins across all values of n. Conse- and avg-k precision (evaluated at varying values quent to the trends across n as observed earlier, of k) as well as clustering quality in our evalua- n ∈ {3, 4} are seen to deliver better accuracies, tion. For the comparative evaluaton, we set DTIM with such gains tapering off beyond n = 4 due to parameters as the following: τ = 10 and a word- sparsity effects. The words, along with the DTIM stem length of 2. We will study trends against vari- nativeness scores for n = 3, can be viewed at ations across these parameters in a separate sec- https://goo.gl/OmhlB3 (the list excludes tion. words with fewer than 3 characters). 5.4 Experimental Results 5.4.1 Precision at the ends of the Ordering 5.5 Analyzing DTIM Table 1 lists the precision measures over various values of k. It may be noted that any instantia- We now analyze the performance of DTIM across tion of DTIM (across the four values of n-gram varying values of the diversity threshold (τ) and size, n) is able to beat the baselines convincingly word-stem lengths. k=50 k=100 k=150 k=200 Top-k Bot-k Avg-k Top-k Bot-k Avg-k Top-k Bot-k Avg-k Top-k Bot-k Avg-k INIT 0.88 0.50 0.69 0.90 0.40 0.65 0.90 0.41 0.66 0.90 0.38 0.64 GEN 0.64 0.10 0.37 0.58 0.11 0.35 0.60 0.15 0.38 0.64 0.17 0.41 DTIM (n=1) 0.94 0.64 0.79 0.90 0.56 0.73 0.90 0.49 0.70 0.92 0.48 0.70 DTIM (n=2) 1.00 0.78 0.89 0.94 0.68 0.81 0.93 0.57 0.75 0.95 0.52 0.74 DTIM (n=3) 0.86 0.76 0.81 0.91 0.75 0.83 0.92 0.69 0.81 0.92 0.64 0.78 DTIM (n=4) 0.82 0.74 0.78 0.87 0.69 0.78 0.83 0.62 0.73 0.85 0.65 0.75

Table 1: Top-k and Bottom-k Precision (best result in each column highlighted)

5.5.1 Diversity Threshold Table 3 analyzes the clustering quality trends of τ Native Transliterable Weighted Average DTIM across varying values of . The table sug- INIT 0.79 0.38 0.69 gests that DTIM is extremely robust to variations GEN 0.73 0.17 0.59 in diversity threshold, despite a slight preference DTIM (n=1) 0.81 0.44 0.72 towards values around 10 and 20. This suggests DTIM (n=2) 0.84 0.50 0.75 that a system designer looking to use DTIM need DTIM (n=3) 0.86 0.60 0.79 DTIM (n=4) 0.86 0.60 0.79 not carefully tune this parameter due to the inher- ent robustness. Table 2: Clustering Quality (best result in each column highlighted) 5.5.2 Word Stem Length Given the nature of Malayalam language where the variations in word lengths are not as high as in English, it seemed very natural to use a word stem length of 2. Moreover, very large words are uncommon in Malayalam. In our corpus, 50% of words were found to contain five characters τ → 5 10 20 50 100 1000 or less, the corresponding fraction being 71% for n = 1 0.72 0.72 0.72 0.72 0.72 0.72 upto seven characters. Our analysis of DTIM n = 2 0.74 0.75 0.75 0.74 0.74 0.74 across variations in word-stem length, illustrated n = 3 0.77 0.79 0.78 0.78 0.78 0.78 in Table 4 strongly supports this intuition with n = 4 0.78 0.79 0.79 0.79 0.79 0.79 clustering quality peaking at stem-length of 2 for n ≥ 2. It is notable, however, that DTIM degrades Table 3: DTIM Clustering Quality against τ gracefully on either side of 2. Trends across dif- ferent settings of word-stem length are interesting since they may provide clues about applicability for other languages with varying character granu- larities (e.g., each Chinese character corresponds to multiple characters in -script).

Stem Length → 1 2 3 4 6 Discussion n = 1 0.64 0.72 0.75 0.56 6.1 Applicability to Other Languages n = 2 0.58 0.75 0.74 0.55 n = 2 0.59 0.79 0.69 0.60 In contrast to earlier work focused on specific n = 2 0.58 0.79 0.69 0.62 languages (e.g., (Koo, 2015)) that use heuristics that are very specific to the language (such as ex- Table 4: DTIM Clustering Quality against Word pected patterns of consonants), DTIM heuristics Stem Length (best result in each row highlighted) are general-purpose in design. The only heuris- tic setting that is likely to require some tuning for applicability in other languages is the word-stem length. We expect the approach would generalize well to other Sanskrit-influenced Dravidian lan- DTIM as the preferred method for the task. We guages such as Kannada/Telugu. Unfortunately, have also released our datasets and labeled sub- we did not have any Kannada/Telugu/Kodava set to help aid future research on this and related knowledge (Dravidian languages have largely dis- tasks. joint speaker-populations) in our team, or ac- cess to labelled datasets in those languages (they 7.1 Future Work are low-resource languages too); testing this on The applicability of DTIM to other Dravidian lan- Kannada/Telugu/Tamil would be interesting future guages is interesting to study. Due to our lack work. The method is expected to be less appli- of familiarity with any other language in the fam- cable to English, the language being significantly ily, we look forward to contacting other groups to different and with potentially fewer transliterable further the generalizability study. While native- words. ness scoring improvements directly translate to re- duction of effort for manual downstream process- 6.2 DTIM in an Application Context ing, quantifying gains these bring about in transla- Within any target application context, machine- tion and retrieval is interesting future work. Ex- labelled transliterable words (and their automati- ploring the relationship/synergy of this task and cally generated transliterations) may need to man- Sandhi splitting (Natarajan and Charniak, 2011) ual screening for accountability reasons. The would form another interesting direction for future high accuracy at either ends of the ordering lends work. Further, we would like to develop methods itself to be exploited in the following fashion; to separate out the two categories of transliterable in lieu of employing experts to verify all la- words, viz., foreign language words and proper bellings/transliterations, low-expertise volunteers nouns. Such a method would enable a more fine- (e.g., students) can be called in to verify labellings grained stratification of the vocabulary. at the ends (top/bottom) of the lists with experts Transliterable words are often within Malay- focusing on the middle (more ambiguous) part of alam used to refer to very topical content, for the list; this frees up experts’ time as against a which suitable words are harder to find. Thus, cross-spectrum expert-verification process, lead- transliterable words could be preferentially treated ing to direct cost savings. We also expect that towards building rules in interpretable cluster- DTIM followed by automatic transliterations of ing (Balachandran et al., 2012) and for modelling bottom-k words would aid in retrieval and ma- context in regex-oriented rule-based information chine translation scenarios. extraction (Murthy et al., 2012). Transliterable words might also hold cues for detecting segment 7 Conclusions and Future Work boundaries in conversational transcripts (Kumma- muru et al., 2009; Padmanabhan and Kumma- We considered the problem of unsupervised sepa- muru, 2007). ration of transliterable and native words in Malay- alam, a critical task in easing automated process- ing of Malayalam text in the company of other References language text. We outlined a key observation on [Baker and Brew2008] Kirk Baker and Chris Brew. the differential diversity beyond word stems, and 2008. Statistical identification of english loanwords formulated an initialization heuristic that would in korean using automatically generated training coarsely separate native and transliterable words. data. In LREC. Citeseer. We proposed the usage of probability distribu- [Balachandran et al.2012] Vipin Balachandran, tions over character n-grams as a way of separately P Deepak, and Deepak Khemani. 2012. Inter- modelling native and transliterable words. We pretable and reconfigurable clustering of document then formulated an iterative optimization method datasets by deriving word-based rules. Knowledge and information systems, 32(3):475–503. that alternatively refines the nativeness scorings and probability distributions. Our technique for [Bandhakavi et al.2014] Anil Bandhakavi, Nirmalie the problem, DTIM, that encompasses the initial- Wiratunga, P Deepak, and Stewart Massie. 2014. ization and iterative refinement, was seen to sig- Generating a word-emotion lexicon from# emo- tional tweets. In * SEM@ COLING, pages 12–21. nificantly outperform other unsupervised baseline methods in our empirical study. This establishes [Chen and Lee1996] Hsin-Hsi Chen and Jen-Chang Lee. 1996. Identification and classification of [Smucker and Allan2006] Mark Smucker and James proper nouns in chinese texts. In COLING. Allan. 2006. An investigation of dirichlet prior smoothing’s performance advantage. Ir. [Deepak and Visweswariah2014] P Deepak and Karthik Visweswariah. 2014. Unsupervised solu- [Tsvetkov and Dyer2015] Yulia Tsvetkov and Chris tion post identification from discussion forums. In Dyer. 2015. Lexicon stratification for translat- ACL (1), pages 155–164. ing out-of-vocabulary words. In ACL Short Papers, pages 125–131. [Dempster et al.1977] A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from in- [Vincze et al.2013] Veronika Vincze, János Zsibrita, complete data via the em algorithm. Journal of the and István Nagy. 2013. Dependency parsing for Royal Statistical Society. Series B (Methodological), identifying hungarian light constructions. In 39(1):1–38. IJCNLP, pages 207–215.

[Goldberg and Elhadad2008] Yoav Goldberg and Michael Elhadad. 2008. Identification of translit- erated foreign words in hebrew script. In Compu- tational and intelligent text processing, pages 466–477. Springer.

[Jeong et al.1999] Kil Soon Jeong, Sung Hyon Myaeng, Jae Sung Leeb, and Key-Sun Choib. 1999. Auto- matic identification and back-transliteration of for- eign words for information retrieval. Inf. Proc. & Mgmt., 35.

[Koo2015] Hahn Koo. 2015. An unsupervised method for identifying loanwords in korean. Language Re- sources and Evaluation, 49(2):355–373.

[Kummamuru et al.2009] Krishna Kummamuru, Deepak Padmanabhan, Shourya Roy, and L Venkata Subramaniam. 2009. Unsupervised segmentation of conversational transcripts. Statis- tical Analysis and Data Mining: The ASA Data Science Journal, 2(4):231–245.

[Murthy et al.2012] Karin Murthy, P Deepak, and Prasad M Deshpande. 2012. Improving recall of regular expressions for information extraction. In International Conference on Web Information Sys- tems Engineering, pages 455–467. Springer.

[Natarajan and Charniak2011] Abhiram Natarajan and Eugene Charniak. 2011. $sˆ3$ - statistical sandhi splitting. In IJCNLP, pages 301–308.

[Ntoulas et al.2001] Alexandros Ntoulas, Sofia Stamou, and Manolis Tzagarakis. 2001. Using a www search engine to evaluate normalization perfor- mance for a highly inflectional language. In ACL (Companion Volume), pages 31–36.

[Padmanabhan and Kummamuru2007] Deepak Pad- manabhan and Krishna Kummamuru. 2007. Mining conversational text for procedures with applications in contact centers. International Journal on Document Analysis and Recognition, 10(3):227–238.

[Prasad et al.2014] Reshma Prasad, Mary Priya Sebas- tian, et al. 2014. A technique to extract transliter- ating phrases in statistical machine translation from english to malayalam. In National Conference on Indian Language Computing.