arXiv:1803.09641v1 [cs.CL] 26 Mar 2018 ytasieal od,w enbt a (a) like both English) mean (from we words, the transliterable words. transliterable By is of amount translation multi- large a for of machine presence media and web retrieval and Malay- lingual harnessing social in from hurdle text Malay- major alam A process text. to alam techniques neces- automatic of sitates presence web growing contexts. The other in or developed been specialized have 2013) which many techniques for al., 2001), et al., as (Vincze et 38 (Ntoulas times Hungarian Greek by spoken three as is is speakers speakers, It it native where . million Kerala state of from official state language the Indian agglutinative southern an the is Malayalam Introduction 1 oayMlylm n b rprnussc as such proper contem- (b) in and form Malayalam, transliterated porary in appear always h rfre ehdfrtetask. the for method as preferred the DTIM establishing Malayalam, for ing scor- nativeness in improvements sig- nificant provides illus- DTIM, method, we our that evaluation, trate empirical formulation. an optimization Using iterative scorings an nativeness the in re- with are step that in fined n-grams distribu- character probability over of tions usage the re- on method words lies Our nativeness. score their to on based method develop optimization we stem, an char- the of beyond diversity acters the a on Outlining observation key language. Malayalam in text for words native from consider words of transliterable We separation unsupervised of problem the . step natural key dif- ferent involving a tasks processing is text aiding words transliterable words from language intrinsic Differentiating nuevsdSprto fTasieal and Transliterable of Separation Unsupervised Abstract police and train aieWrsfrMalayalam for Words Native ue’ nvriyBlat UK Belfast, University Queen’s htvirtually that [email protected] epkP Deepak uvybt lse fmethods. of briefly classes We both data. survey labeled without meth- work can some that been ods have While used there have learning, problem supervised 1990s. the addressing the methods since analysis, most text attention cross-lingual attracted for has task critical a be- fragments, ing text transliterable of Identification Work Related 2 7. Section ap- in conclusions proposed by forms followed analysis our 5 empirical Section Our of 4. description Section in the proach and Sec- in 3 statement problem tion the by followed is This DTIM. of effectiveness the establishes analysis cal empiri- Our distributions. probability n-gram acter char- using modelled statistics leverag- dictionary-level word, ing each re- of scoring iteratively nativeness the DTIM, fines their method, on optimization based Our corpus word Malayalam each a scores from and words distinct of nary supervised aigotsuch out rating 2015). vo- Dyer, core and the (Tsvetkov outside cabulary fall that words meth- translate specialized to ods devising in has interest there effec- that recent improve notable been ia to it link context, this the In of tiveness. use translation make statistical to a system help multi-lingual and a engine, in retrieval transliteration) their (wrt index text; separate terms as English treated being of them avoid company will this the processing transliterate in involve to text that useful Malayalam scenarios is for It words such words. distinct all form each of nouns translit- proper that and found words we erated dataset, article analy- news manual a a of On sis text. English with translated correlate than to transliterated be to need that names od ihna naee aae;w ee to refer we as dataset; latter unlabeled the an within words eotierltdwr nteae nScin2. Section in area the in work related outline We nti ae,w osdrtepolmo sepa- of problem the consider we paper, this In ehd TM httksadictio- a takes that DTIM, method, “native” transliterable od.W rps an propose We words. od rmteother the from words nativeness 10 - 12% un- . 2.1 Supervised and ‘pseudo-supervised’ 2.3 Positioning the Transliterable Word Methods Identification Task Nativeness scoring of words may be seen as a vo- An early work(Chen and Lee, 1996) focuses on cabulary stratification step (upon usage of thresh- a sub-problem, that of supervised identification olds) for usage by downstream applications. A of proper nouns for Chinese. (Jeong et al., 1999) multi-lingual text mining application that uses consider leveraging decision trees to address Malayalam and English text would benefit by the related problem of learning transliteration transliterating non-native Malayalam words to En- and back-transliteration rules for English/Korean glish, so the transliterable Malayalam token and word pairs. Recognizing the costs of procur- its transliteration is treated as the same token. ing training data, (Baker and Brew, 2008) and For machine translation, transliterable words may (Goldberg and Elhadad, 2008) explore usage be channeled to specialized translation methods of pseudo-transliterable words generated using (e.g., (Tsvetkov and Dyer, 2015)) or for manual transliteration rules on an English dictionary screening and translation. for Korean and Hebrew respectively. Such pseudo-supervision, however, would not be able 3 Problem Definition to generate uncommon domain-specific terms We now define the problem more formally. Con- such as medical/scientific terminology for usage sider n distinct words obtained from Malayalam in such domains (unless specifically tuned), and is text, W = {...,w,...}. Our task is to devise hence limited in utility. a technique that can use W to arrive at a native- ness score for each word, w, within it, as wn. We would like wn to be an accurate quantification of native-ness of word w. For example, when words 2.2 Unsupervised Methods in W are ordered in the decreasing order of wn scores, we expect to get the native words at the A recent work proposes that multi-word phrases beginning of the ordering and vice versa. We do in Malayalam text where their component words not presume availability of any data other than W; exhibit strong co-occurrence be categorized as this makes our method applicable across scenar- transliterable phrases (Prasad et al., 2014). Their ios where corpus statistics are unavailable due to intuition stems from observing contiguous words privacy or other reasons. such as test dose which often occur in translit- erated form while occurring together, but get re- 3.1 Evaluation placed by native words in other contexts. Their Given that it is easier for humans to crisply classify method is however unable to identify single each word as either native or transliterable (nouns transliterable words, or phrases involving words or transliterated english words) in lieu of attaching such as train and police whose transliterations are a score to each word, the nativeness scoring (as heavily used in the company of native Malayalam generated by a scoring method such as ours) often words. A recent method for Korean (Koo, 2015) needs to be evaluated against a crisp nativeness as- starts by identifying a seed set of transliterable sessment, i.e., a scoring with scores in {0, 1}. To words as those that begin or end with consonant aid this, we consider the ordering of words in the clusters and have vowel insertions; this is spe- labeled set in the decreasing (or more precisely, cific to Korean since Korean words apparently do non-increasing) order of nativeness scores (each not begin or end with consonant clusters. High- method produces an ordering for the dataset). To frequency words are then used as seed words for evaluate this ordering, we use two sets of metrics native Korean for usage in a Naive Bayes classi- for evaluation: fier. In addition to the outlined reasons that make both the unsupervised methods inapplicable for • Precision at the ends of the ordering: Top- our task, they both presume availability of corpus k precision denotes the fraction of native frequency statistics. We focus on a general sce- words within the k words at the top of the nario assuming the availability of only a word lex- ordering; analogously, Bottom-k precision icon. is the fraction of transliterable words among the bottom k. Since a good scoring would more strongly and weaken any initial preference to likely put native words at the top of the order- transliterable words. The vice versa holds for the ing and the transliterable ones at the bottom, a transliterable word models. We will first outline good scoring method would intuitively score the initialization step followed by the description high on both these metrics. We call the aver- of the method. age of the top-k and bottom-k precision for a given k, as Avg-k precision. These measures, 4.1 Diversity-based Initialization evaluated at varying values of k, indicate the Our initialization is inspired by an observation on quality of the nativeness scoring. the variety of suffixes attached to a word stem. |pu|ra|2 • Clustering Quality: Consider the cardinal- Consider a word stem , a stem commonly ities of the native and transliterable words leading to native Malayalam words; its suffixes from the labeled set as being N and T re- are observed to start with a variety of charac- spectively. We now take the top-N words and ters such as |ttha| (e.g., |pu|ra|ttha|kki|), |me| bottom-T words from the ordering generated (e.g., |pu|ra|me|), |mbo| (e.g., |pu|ra|mbo|kku|) by each method, and compare against the re- and |ppa| (e.g., |pu|ra|ppa|du|). On the other hand, spective labeled sets as in the case of stan- stems that mostly lead to transliterable words of- dard clustering quality evaluation1. Since the ten do not exhibit so much of diversity. For exam- cardinalities of the generated native (translit- ple, |re|so| is followed only by |rt| (i.e., resort) and erable) cluster and the native (transliterable) |po|li| is usually only followed by |s| (i.e., police). labeled set is both N (T ), the Recall of the Some stems such as |o|ppa| lead to transliterations cluster is identical to its Purity/Precision, of two English words such as open and operation. and thus, the F-measure too; we simply Our observation, upon which we model the initial- call it Clustering Quality. A cardinality- ization, is that the variety of suffixes is generally weighted average of the clustering quality correlated with native-ness (i.e., propensity to lead across the native and transliterable clusters to a native word) of word stems. This is intuitive yields a single value for the clustering quality since non-native word stems provide limited flex- across the dataset. It may be noted that we are ibility to being modified by derivational or inflec- not making the labeled dataset available to tional suffixes as compared to native ones. the method generating the ordering, instead For simplicity, we use the first two characters of merely using it’s cardinalities for evaluation each word as the word stem; we will evaluate the purposes. robustness of DTIM to varying stem lengths in our empirical evaluation, while consistently using the 4 Our Method: DTIM stem length of two characters in our description. We start by associating each distinct word stem in We now introduce our method, Diversity-based W with the number of unique third characters that Transliterable Word Identification for Malayalam follow it (among words in W); in our examples, (DTIM). We use probability distributions over |pu|ra| and |o|pa| would be associated with 4 and 2 character n-grams to separately model transliter- respectively. We initialize the native-ness weights able and native words, and develop an optimiza- as proportional to the diversity of 3rd characters tion framework that alternatively refines the n- beyond the stem: gram distributions and nativeness scoring within each iteration. DTIM involves an initialization that induces a “coarse” separation between na- |u3(wstem, W)| wn = min 0.99, (1) tive and transliterable words followed by iterative 0  τ  refinement. The initialization is critical in opti- mization methods that are vulnerable to local op- where u3(wstem, W) denotes the set of third char- tima; the pure word distribution needs to be initial- acters that follow the stem of word w among ized to “coarsely” prefer pure words over translit- words in W. We flatten off wn0 scores beyond erable words. This will enable further iterations a diversity of τ (note that a diversity of τ or 2 to exploit the initial preference direction to fur- We will represent Malayalam words in transliterated form for reading ther refine the model to “attract” the pure words by those who might not be able to read Malayalam. A pipe would separate Malayalam characters; for example |pu| corresponds to a single Malayalam 1 https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.htmlcharacter. higher will lead to the second term becoming 1.0 with each other (and so for the lower terms), this or higher, kicking in the min function to choose function would be maximized for a desirable esti-

0.99 for wn0 ) as shown in the above equation. mate of the variables θ = {N , T , {. . . , wn,...}}. We give a small transliterable-ness weight even to Conversely, by striving to maximize the objective highly diverse stems to reduce over-reliance on the function, we would arrive at a desirable estimate initialization. We set τ = 10 based on our obser- of the variables. An alternative construction yield- vation from the dataset that most word stems hav- ing a minimizing objective would be as follows: ing more than 10 distinct characters were seen to be native. As in the case of word stem length, we 2 2 (1 − wn) ×N (c)+ w ×T (c) (3) study DTIM trends across varying τ in our empir-  n  wY∈W cY∈w ical analysis. wn0 is in [0, 1]; analogously, (1 − wn0 ) may be regarded as a score of transliterable- In this form, given a good estimate of the vari- ness. ables, the native (transliterable) words have their nativeness (transliterableness) weights multiplied 4.2 Objective Function and Optimization with the support from the transliterable (native) Framework models. In other words, maximizing the objective As outlined earlier, we use separate character n- in Eq. 2 is semantically similar to minimizing the gram probability distributions to model native and objective in Eq. 3. As we will illustrate soon, it transliterable words. We would like these proba- is easier to optimize for N and T using the maxi- bility distributions support the nativeness scoring, mizing formulation in Eq. 2 while the minimizing and vice versa. While the size of the n-grams (i.e., objective in Eq. 3 yields better to optimize for the whether n = 1, 2, 3 or 4) is a system-level param- word nativeness scores, {. . . , wn,...}. eter, we use n = 1 for simplicity in our descrip- tion. We denote the native and transliterable dis- 4.3 Learning N and T using the Maximizing tributions as N and T respectively, with N (c) and Objective T (c) denoting the weight associated with the char- We start by taking the log-form of the objective in acter c according to the distributions. Consider the Eq. 2 (this does not affect the optimization direc- following function, given a particular state for the tion), yielding: N , T and wns:

2 2 Omax = ln wn×N (c)+(1−wn) ×T (c) 2 2 ∈   w ×N (c) + (1 − wn) ×T (c) (2) wX∈W Xc w  n  wY∈W cY∈w (4) The distributions, being probability distribu- This measures the aggregate supports for tions over n-grams, should sum to zero. This con- words in W, the support for each word mea- straint, for our unigram models, can be written as: sured as an interpolated support from across the distributions N and T with weighting N (c)= T (c) = 1 (5) factors squares of the nativeness scores (i.e., Xc Xc w s) and transliterableness scores (i.e., (1 − n Fixing the values of {. . . , w ,...} and T (or w )s) respectively. Similar mixing models have n n N ), we can now identify a better estimate for N been used earlier in emotion lexicon learn- (or T ) by looking for an optima (i.e., where the ing (Bandhakavi et al., 2014) and solution post objective function has a slope of zero). Towards discovery (Deepak and Visweswariah, 2014). The that, we take the partial derivative (or slope) of the squares of the nativeness scores are used in our objective for a particular character. : model (instead of the raw scores) for optimiza- tion convenience. A highly native word should ′ 2 intuively have a high w (nativeness) and a ∂Omax freq(c , w) × wn n = +λN ∂N (c′)  w2 N (c′) + (1 − w )2T (c′)  high support from N and correspondingly low wX∈W n n transliterable-ness (i.e., (1 − wn)) and support (6) from T ; a highly transliterable word would be ex- where freq(c′, w) is the frequency of the char- ′ pected to have exactly the opposite. Due to the de- acter c in w and λN denotes the Lagrangian multi- sign of Eq. 2 in having the higher terms multiplied plier corresponding to the sum-to-unity constraint for N . Equating this to zero does not however 4.4 Learning the nativeness scores, ′ yield a closed form solution for N , but a simple {. . . , wn,...}, using the Minimizing re-arrangement yields an iterative update formula: Objective Analogous to the previous section, we take the log-form of Eq. 3: ′ 2 ′ freq(c , w) × wn N (c ) ∝ ′ (7) 2 2 T (c ) w w ′ wX∈W n + (1 − n) NP (c ) 2 2  Omin = ln (1−wn) ×N (c)+w ×T (c)  n  wX∈W Xc∈w The N term in the RHS is denoted as NP (10) to indicate the usage of the previous estimate of Unlike the earlier case, we do not have any con- N . The sum-to-one constraint is trivially achieved ′ straints since the sum-to-unit constraint on the na- by first estimating the N (c )s by treating Eq. 7 tiveness and transliterableness scores are built in as equality, followed by normalizing the scores into the construction. We now fix the values of all across the character vocabulary. Eq. 7 is intuitively ′ other variables and find the slope wrt wn, where reasonable, due to establishing a somewhat direct w′ indicates a particular word in W. relationship between N and wn (in the numer- ator), thus allowing highly native words to con- ′ ′ tribute more to building . The analogous update ∂Omin 2wnT (c) + 2wnN (c) − 2N (c) N = formula for T fixing N turns out to be: ∂w′ w′2T (c) + (1 − w′ )2N (c) n cX∈w′ n n (11) We equate the slope to zero and form an itera- ′ 2 ′ freq(c , w) × (1 − wn) tive update formula, much like in the distribution T (c ) ∝ ′ (8) 2 2 N (c ) estimation phase. w w ′ wX∈W (1 − n) + n TP (c )  N (c) Eq. 7 and Eq. 8 would lead us closer to a max- ′ c∈w (w′2T (c)+(1−w′ )2N (c)) w′ = n n (12) ima for Eq. 4 is their second (partial) derivatives n P N (c)+T (c) 3 ∈ ′ ′2 ′ 2 are negative . To verify this, we note that the sec- c w (wn T (c)+(1−wn) N (c)) ′ P ond (partial) derivative wrt N (c ) is as follows ′ Using the previous estimates of wn for the RHS yields an iterative update form for the nativeness

2 scores. As in the model estimation phase, the up- ∂ Omax = date rule establishes a reasonably direct relation- ∂2N (c′) ′ ship between wn and N . Since our objective is to minimize Omin, we would like to verify the ′ freq(c , w)(w2 )2 direction of optimization using the second partial (−1)× n (9) 2 ′ 2 ′ 2 derivative. wX∈W wnN (c ) + (1 − wn) T (c )  It is easy to observe that the RHS is a prod- ∂2O min = uct of and a sum of a plurality of positive 2 ′ −1 ∂ wn terms (square terms that are trivially positive, with ′ ′ 2 the exception being the freq(., .) term which is N (c)T (c) − wnT (c) − (1 − wn)N (c) ′2 ′ 2 2  also non-negative by definition), altogether yield- cX∈w′ wn T (c) + (1 − wn) N (c) ing a negative value. That the the second (partial)  (13) derivative is negative confirms that the update for- We provide an informal argument for the pos- mula derived from the first partial derivative in- itivity of the second derivative; note that the de- ′ deed helps in maximizing Omax wrt N (c ).A nominator is a square term making it enough to similar argument holds for the T (c′) updates as analyze just the numerator term. Consider a highly ′ well, which we omit for brevity. native word (high wn) whose characters would in- tuitively satisfy N (c) > T (c). For the bound- 3 ′ http://mathworld.wolfram.com/SecondDerivativeTest.hary case of wtmln = 1, the numerator term reduces to T (c) × (N (c) −T (c)) which would be posi- ous character triplets and correspondingly learning tive given the expected relation between N (c) and the distributions N and T over triplets. The DTIM T (c). A similar argument holds for highly translit- structure is evidently inspired by the Expectation- ′ erable words. For words with wn → 0.5 where we Maximization framework (Dempster et al., 1977) would expect N (c) ≈ T (c), the numerator be- involving alternating optimizations of an objective comes N (c)T (c)−0.25(T (c)−N (c))2, which is function; DTIM, however, uses different objective expected to be positive since the difference term functions for the two steps for optimization conve- is small, making it’s square very small in com- nience. parison to the first product term. To outline the informal nature of the argument, it may be noted 5 Experiments that T (c) > N (c) may hold for certain characters We now describe our empirical study of DTIM, within highly native words; but as long as most starting with the dataset and experimental setup of the characters within highly native words sat- leading on to the results and analyses. isfy the N (c) > T (c), there would be sufficient positivity to offset the negative terms induced with 5.1 Dataset such outlier characters. We evaluate DTIM on a set of 65068 distinct Algorithm 1: DTIM words from across news articles sourced from Mathrubhumi4, a popular Malayalam newspaper; Input: A set of Malayalam words, W this word list is made available publicly5. For eval- Output: A nativeness scoring wn ∈ [0, 1] for uation purposes, we got a random subset of 1035 every word w in W words labeled by one of three human annotators; Hyper-parameters: word stem length, τ, n that has been made available publicly6 too, each Initialize the w scores for each word using n word labeled as either native, transliterable or un- the diversity metric in Eq. 1 using the chosen known. There were approximately 3 native words stem length and τ for every transliterable word in the labeled set, while not converged and number of iterations reflective of distribution in contemporary Malay- not reached do Estimate n-gram distributions N and T alam usage as alluded to in the introduction. We using Eq. 7 and Eq. 8 respectively will use the whole set of 65068 words as input to Learn nativeness weights for each word the method, while the evaluation would obviously using Eq. 12 be limited to the labelled subset of 1035 words. end 5.2 Baselines return latest estimates of nativeness weights As outlined in Section 2, the unsupervised version of the problem of telling apart na- tive and transliterable words for Malayalam 4.5 DTIM: The Method and/or similar languages has not been ad- Having outlined the learning steps, the method dressed in literature, to the best of our knowl- is simply an iterative usage of the learning steps edge. The unsupervised Malayalam-focused as outlined in Algorithm 1. In the first invoca- method(Prasad et al., 2014) (Ref: Sec 2.2) is able tion of the distribution learning step where previ- to identify only transliterable word-pairs, making ous estimates are not available, we simply assume it inapplicable for contexts such as a health data a uniform distribution across the n-gram vocab- scenario where individual english words are of- ulary for usage as the previous estimates. Each ten transliterated for want of a suitable malay- of the update steps are linear in the size of the alam alternative. The Korean method(Koo, 2015) dictionary, making DTIM a computationally light- is too specific to Korean language and cannot be weight method. Choosing n = 2 instead of uni- used for other languages due to the absence of a grams (as used in our narrative) is easy since that generic high-precision rule to identify a seed set simply involves replacing the c ∈ w all across the of transliterable words. With both the unsuper- update steps by [c , c ] ∈ w with [c , c ] de- i i+1 i i+1 4http://www.mathrubhumi.com 5 noting pairs of contiguous characters within the Dataset: https://goo.gl/DOsFES 6 word; similarly, n = 3 involves usage of contigu- Labeled Set: https://goo.gl/XEVLWv vised state-of-the-art approaches being inapplica- size, n) is able to beat the baselines convincingly ble for our task, we compare against an intuitive on each metric on each value of k, convincingly generalization-based baseline, called GEN, that establishing the effectiveness of the DTIM for- orders words based on their support from the com- mulation. DTIM is seen to be much more effec- bination of a unigram and bi-gram character lan- tive in separating out the native and transliterable guage model learnt over W; this leads to a scoring words at either ends of the ordering, than the base- as follows: lines. It is also notable that EM-style iterations are able to significantly improve upon the initial- ization (i.e., INIT). That the bottom-k precision is wGEN n = seen to be consistently lower than top-k precision

λ × BW (ci+1|ci) + (1 − λ) × UW(ci+1) needs to be juxtaposed in the context of the ob-

[ci,cYi+1]∈w servation that there were only 25% transliterable (14) words against 75% native words; thus, the lift in where BW and UW are bigram and un- precision against a random ordering is much more igram character-level language models built substantial for the transliterable words. The trends over all words in W. We set λ = across varying n-gram sizes (i.e., n) in DTIM is 0.8 (Smucker and Allan, 2006). We experimented worth noting too; the higher values of n (such with higher-order models in GEN, but observed as 3 and 4) are seen to make more errors at the drops in evaluation measures leading to us sticking ends of the lists, whereas they catch-up with the to the usage of the unigram and bi-gram models. n ∈ {1, 2} versions as k increases. This indi- The form of Eq. 14 is inspired by an assumption cates that smaller-n DTIM is being able to tell similar to that used in both (Prasad et al., 2014) apart a minority of the words exceedingly well and (Koo, 2015) that transliterable words are rare. (wrt native-ness), whereas the higher n-gram mod- Thus, we expect they would not be adequately elling is able to spread out the gains across a larger supported by models that generalize over whole spectrum of words in W. Around n = 4 and be- of W. We also compare against our diversity- yond, sparsity effects (since 4-grams and 5-grams based initialization score from Section 4.1, which would not occur frequently, making it harder to we will call as INIT. For ease of reference, we exploit their occurrence statistics) are seen to kick outline the INIT scoring: in, causing reductions in precision.

INIT |u3(wstem, W)| 5.4.2 Clustering Quality wn = min 0.99, (15)  τ  Table 2 lists the clustering quality metric across The comparison against INIT enables us to iso- the various methods. Clustering quality, unlike the late and highlight the value of the iterative update precision metrics, is designed to evaluate the entire formulation vis-a-vis the initialization. ordering without limiting the analysis to just the top-k and bottom-k entries. As in the earlier case, 5.3 Evaluation Measures and Setup DTIM convincingly outperforms the baselines by As outlined in Section 3, we use top-k, bottom-k healthy margins across all values of n. Conse- and avg-k precision (evaluated at varying values quent to the trends across n as observed earlier, of k) as well as clustering quality in our evalua- n ∈ {3, 4} are seen to deliver better accuracies, tion. For the comparative evaluaton, we set DTIM with such gains tapering off beyond n = 4 due to parameters as the following: τ = 10 and a word- sparsity effects. The words, along with the DTIM stem length of 2. We will study trends against vari- nativeness scores for n = 3, can be viewed at ations across these parameters in a separate sec- https://goo.gl/OmhlB3 (the list excludes tion. words with fewer than 3 characters). 5.4 Experimental Results 5.5 Analyzing DTIM 5.4.1 Precision at the ends of the Ordering Table 1 lists the precision measures over various We now analyze the performance of DTIM across values of k. It may be noted that any instantia- varying values of the diversity threshold (τ) and tion of DTIM (across the four values of n-gram word-stem lengths. k=50 k=100 k=150 k=200 Top-k Bot-k Avg-k Top-k Bot-k Avg-k Top-k Bot-k Avg-k Top-k Bot-k Avg-k INIT 0.88 0.50 0.69 0.90 0.40 0.65 0.90 0.41 0.66 0.90 0.38 0.64 GEN 0.64 0.10 0.37 0.58 0.11 0.35 0.60 0.15 0.38 0.64 0.17 0.41 DTIM (n=1) 0.94 0.64 0.79 0.90 0.56 0.73 0.90 0.49 0.70 0.92 0.48 0.70 DTIM (n=2) 1.00 0.78 0.89 0.94 0.68 0.81 0.93 0.57 0.75 0.95 0.52 0.74 DTIM (n=3) 0.86 0.76 0.81 0.91 0.75 0.83 0.92 0.69 0.81 0.92 0.64 0.78 DTIM (n=4) 0.82 0.74 0.78 0.87 0.69 0.78 0.83 0.62 0.73 0.85 0.65 0.75

Table 1: Top-k and Bottom-k Precision (best result in each column highlighted)

5.5.1 Diversity Threshold Table 3 analyzes the clustering quality trends of τ Native Transliterable Weighted Average DTIM across varying values of . The table sug- INIT 0.79 0.38 0.69 gests that DTIM is extremely robust to variations GEN 0.73 0.17 0.59 in diversity threshold, despite a slight preference DTIM (n=1) 0.81 0.44 0.72 towards values around 10 and 20. This suggests DTIM (n=2) 0.84 0.50 0.75 that a system designer looking to use DTIM need DTIM (n=3) 0.86 0.60 0.79 DTIM (n=4) 0.86 0.60 0.79 not carefully tune this parameter due to the inher- ent robustness. Table 2: Clustering Quality (best result in each column highlighted) 5.5.2 Word Stem Length Given the nature of Malayalam language where the variations in word lengths are not as high as in English, it seemed very natural to use a word stem length of 2. Moreover, very large words are uncommon in Malayalam. In our corpus, 50% of words were found to contain five characters τ → 5 10 20 50 100 1000 or less, the corresponding fraction being 71% for n = 1 0.72 0.72 0.72 0.72 0.72 0.72 upto seven characters. Our analysis of DTIM n = 2 0.74 0.75 0.75 0.74 0.74 0.74 across variations in word-stem length, illustrated n = 3 0.77 0.79 0.78 0.78 0.78 0.78 in Table 4 strongly supports this intuition with n = 4 0.78 0.79 0.79 0.79 0.79 0.79 clustering quality peaking at stem-length of 2 for n ≥ 2. It is notable, however, that DTIM degrades Table 3: DTIM Clustering Quality against τ gracefully on either side of 2. Trends across dif- ferent settings of word-stem length are interesting since they may provide clues about applicability for other languages with varying character granu- larities (e.g., each Chinese character corresponds to multiple characters in -script).

Stem Length → 1 2 3 4 6 Discussion n 0.75 = 1 0.64 0.72 0.56 6.1 Applicability to Other Languages n = 2 0.58 0.75 0.74 0.55 n = 2 0.59 0.79 0.69 0.60 In contrast to earlier work focused on specific n = 2 0.58 0.79 0.69 0.62 languages (e.g., (Koo, 2015)) that use heuristics that are very specific to the language (such as ex- Table 4: DTIM Clustering Quality against Word pected patterns of consonants), DTIM heuristics Stem Length (best result in each row highlighted) are general-purpose in design. The only heuris- tic setting that is likely to require some tuning for applicability in other languages is the word-stem length. We expect the approach would generalize well to other Sanskrit-influenced Dravidian lan- DTIM as the preferred method for the task. We guages such as Kannada/Telugu. Unfortunately, have also released our datasets and labeled sub- we did not have any Kannada/Telugu/Kodava set to help aid future research on this and related knowledge (Dravidian languages have largely dis- tasks. joint speaker-populations) in our team, or ac- cess to labelled datasets in those languages (they 7.1 Future Work are low-resource languages too); testing this on The applicability of DTIM to other Dravidian lan- Kannada/Telugu/Tamil would be interesting future guages is interesting to study. Due to our lack work. The method is expected to be less appli- of familiarity with any other language in the fam- cable to English, the language being significantly ily, we look forward to contacting other groups to different and with potentially fewer transliterable further the generalizability study. While native- words. ness scoring improvements directly translate to re- duction of effort for manual downstream process- 6.2 DTIM in an Application Context ing, quantifying gains these bring about in transla- Within any target application context, machine- tion and retrieval is interesting future work. Ex- labelled transliterable words (and their automati- ploring the relationship/synergy of this task and cally generated transliterations) may need to man- Sandhi splitting (Natarajan and Charniak, 2011) ual screening for accountability reasons. The would form another interesting direction for future high accuracy at either ends of the ordering lends work. Further, we would like to develop methods itself to be exploited in the following fashion; to separate out the two categories of transliterable in lieu of employing experts to verify all la- words, viz., foreign language words and proper bellings/transliterations, low-expertise volunteers nouns. Such a method would enable a more fine- (e.g., students) can be called in to verify labellings grained stratification of the vocabulary. at the ends (top/bottom) of the lists with experts Transliterable words are often within Malay- focusing on the middle (more ambiguous) part of alam used to refer to very topical content, for the list; this frees up experts’ time as against a which suitable words are harder to find. Thus, cross-spectrum expert-verification process, lead- transliterable words could be preferentially ing to direct cost savings. We also expect that treated towards building rules in interpretable DTIM followed by automatic transliterations of clustering (Balachandran et al., 2012) and for bottom-k words would aid in retrieval and ma- modelling context in regex-oriented rule-based chine translation scenarios. information extraction (Murthy et al., 2012). Transliterable words might also hold cues for 7 Conclusions and Future Work detecting segment boundaries in conversa- tional transcripts (Kummamuru et al., 2009; We considered the problem of unsupervised sepa- Padmanabhan and Kummamuru, 2007). ration of transliterable and native words in Malay- alam, a critical task in easing automated process- ing of Malayalam text in the company of other References language text. We outlined a key observation on [Baker and Brew2008] Kirk Baker and Chris Brew. the differential diversity beyond word stems, and 2008. Statistical identification of english loanwords formulated an initialization heuristic that would in korean using automatically generated training coarsely separate native and transliterable words. data. In LREC. Citeseer. We proposed the usage of probability distribu- [Balachandranetal.2012] Vipin Balachandran, tions over character n-grams as a way of separately P Deepak, and Deepak Khemani. 2012. Inter- modelling native and transliterable words. We pretable and reconfigurable clustering of document then formulated an iterative optimization method datasets by deriving word-based rules. Knowledge and information systems, 32(3):475–503. that alternatively refines the nativeness scorings and probability distributions. Our technique for [Bandhakavi et al.2014] Anil Bandhakavi, Nirmalie the problem, DTIM, that encompasses the initial- Wiratunga, P Deepak, and Stewart Massie. 2014. ization and iterative refinement, was seen to sig- Generating a word-emotion lexicon from# emo- tional tweets. In * SEM@ COLING, pages 12–21. nificantly outperform other unsupervised baseline methods in our empirical study. This establishes [Chen and Lee1996] Hsin-Hsi Chen and Jen-Chang Lee. 1996. Identification and classification of [Smucker and Allan2006] Mark Smucker and James proper nouns in chinese texts. In COLING. Allan. 2006. An investigation of dirichlet prior smoothing’s performance advantage. Ir. [DeepakandVisweswariah2014] P Deepak and Karthik Visweswariah. 2014. Unsupervised solu- [Tsvetkov and Dyer2015] Yulia Tsvetkov and Chris tion post identification from discussion forums. In Dyer. 2015. Lexicon stratification for translat- ACL (1), pages 155–164. ing out-of-vocabulary words. In ACL Short Papers, pages 125–131. [Dempster et al.1977] A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from in- [Vincze et al.2013] Veronika Vincze, János Zsibrita, complete data via the em algorithm. Journal of the and István Nagy. 2013. Dependency parsing for Royal Statistical Society. Series B (Methodological), identifying hungarian light constructions. In 39(1):1–38. IJCNLP, pages 207–215.

[Goldbergand Elhadad2008] Yoav Goldberg and Michael Elhadad. 2008. Identification of translit- erated foreign words in hebrew script. In Compu- tational and intelligent text processing, pages 466–477. Springer.

[Jeong et al.1999] Kil Soon Jeong, Sung Hyon Myaeng, Jae Sung Leeb, and Key-Sun Choib. 1999. Auto- matic identification and back-transliteration of for- eign words for information retrieval. Inf. Proc. & Mgmt., 35.

[Koo2015] Hahn Koo. 2015. An unsupervised method for identifying loanwords in korean. Language Re- sources and Evaluation, 49(2):355–373.

[Kummamuruetal.2009] Krishna Kummamuru, Deepak Padmanabhan, Shourya Roy, and L Venkata Subramaniam. 2009. Unsupervised segmentation of conversational transcripts. Statis- tical Analysis and Data Mining: The ASA Data Science Journal, 2(4):231–245.

[Murthyetal.2012] Karin Murthy, P Deepak, and Prasad M Deshpande. 2012. Improving recall of regular expressions for information extraction. In International Conference on Web Information Sys- tems Engineering, pages 455–467. Springer.

[Natarajan and Charniak2011] Abhiram Natarajan and Eugene Charniak. 2011. $sˆ3$ - statistical sandhi splitting. In IJCNLP, pages 301–308.

[Ntoulas et al.2001] AlexandrosNtoulas, Sofia Stamou, and Manolis Tzagarakis. 2001. Using a www search engine to evaluate normalization perfor- mance for a highly inflectional language. In ACL (Companion Volume), pages 31–36.

[Padmanabhan and Kummamuru2007] Deepak Pad- manabhan and Krishna Kummamuru. 2007. Mining conversational text for procedures with applications in contact centers. International Journal on Document Analysis and Recognition, 10(3):227–238.

[Prasad et al.2014] Reshma Prasad, Mary Priya Sebas- tian, et al. 2014. A technique to extract transliter- ating phrases in statistical machine translation from english to malayalam. In National Conference on Indian Language Computing.