N-Gram Based Language Identification of Individual Words

N-gram based Language Identification of Individual Words Oluwapelumi Giwa and Marelie H. Davel Multilingual Speech Technologies, North-West University, Vanderbijlpark, South Africa [email protected], [email protected] Abstract—Various factors influence the accuracy with which code-switched words from orthography alone can be a useful the language of individual words can be classified using n-grams. step in building optimised grapheme-based automatic speech We consider a South African text-based language identification recognition (ASR) systems. (LID) task and experiment with two different types of n- gram classifiers: a Na¨ıve Bayes classifier and a Support Vector We approach the LID task using n-grams. N-gram based Machine. Specifically, we investigate various factors that influence methods are widely used ( [7], [8], [9]), and highly suitable for LID accuracy when identifying generic words (as opposed to LID with small training sets and short test samples [9]. Botha running text) in four languages. These include: the importance and Barnard [10] applied Na¨ıve Bayes (NB) classification and of n-gram smoothing (Katz smoothing, absolute discounting and Support Vector Machines (SVMs) using n-grams to 11 South Witten-Bell smoothing) when training Na¨ıve Bayes classifiers; the effect of training corpus size on classification accuracy; African languages, using segments of 15 to 300 characters and the relationship between word length, n-gram length and each. While smoothing was not investigated in [10], Vatanen classification accuracy. For the best variant of each of the two sets et al. [9] demonstrated the importance of smoothing when of algorithms, we achieve relatively comparable classification ac- using n-gram models to classify short text segments. The curacies. The accuracy of the Support Vector Machine (88.16%, current study extends previous contributions by comparing the obtained with a Radial Basis function) is higher than that of the Na¨ıve Bayes classifier (87.62%, obtained using Witten-Bell two classification techniques mentioned above (SVMs and NB smoothing), but the latter result is associated with a significantly classifiers) for the classification of individual words rather lower computational cost. than general text segments; and by analysing the effect of Index Terms: text-based language identification, smoothing, smoothing, specifically. In this context, the relationship be- character n-grams, Na¨ıve Bayes classifier, support vector ma- tween word length, n-gram length and classification accuracy chine. is also investigated. I. INTRODUCTION The paper is structured as follows: Section II provides an overview of different LID techniques. Section III describes Code switching is the act of mixing words from different the methods used during experimentation in more detail. languages within a single sentence. In running text, words Section IV provides an overview of the experimental approach occurring side-by-side originally may have come from differ- followed and describes the data set used. Section V presents ent languages; that is, in a code-switched sentence or phrase, the various experiments and results. A summary of main it is typical to find individual words from one language findings concludes the paper in Section VI. (referred to as the embedded language) embedded within a larger sentence of a different language (referred to as the II. BACKGROUND matrix language) [1]. In regions with multiple languages, it In principle, the language origin of an input text string is a common act amongst younger generations to use mixed can be estimated by creating a model per language (from language within text conversations [2]. A specific scenario is training data) and selecting the best-fitting model to predict the use of numerical digits, where people often prefer using the language source of the input string. Statistical text-based numbers from a different language to the matrix language. LID techniques include: Na¨ıve Bayes classification [10], rank- In speech and text processing systems, identifying code- ing methods [7], Markov models [8], [11], support vector switched words is important for applications such as machine machines [12], decision trees [13] and k-nearest neighbour translation, speech synthesis, information extraction and pro- classification [14]. Many LID experiments have adopted char- nunciation prediction. For example, Bhargava and Kondrak [3] acter n-gram models; such techniques have demonstrated good showed that text-based language identification (LID) can be performances over a variety of applications. used to improve the accuracy of grapheme-to-phoneme con- To the best of our knowledge, limited previous work has version, and both Font Llitjos and Black [4] and Church [5] focused on identifying the language of generic words in demonstrated that being able to identify the language of origin isolation, with more results available with regard to LID of a name (from its orthography) is important in being able of running text. (Exceptions being [15], [13], [3], discussed to predict the possible pronunciations of that name. In related below.). work, Basson and Davel [6] showed that the ability to identify When classifying longer text segments, accuracy quickly Katz smoothing, absolute discounting, and modified Kneser- approaches 100% given enough text; for example, Cavnar Ney discounting. They observed the best result with absolute etal: [7] used rank difference to predict the distance between discounting. the most frequent n-gram in the language model and the test Apart from the sample text length, the accuracies of document. They extracted their evaluation set from Usenet these approaches depend on various other factors. Botha and newsgroup articles written in 14 different languages. They Barnard [10] discussed different factors that could influence achieved an accuracy of 99.8% on text of 300 characters or text-based LID accuracy. These factors included: size of the more, while retaining the first 400 most common n-grams up training data, input text size, n-gram size, LID techniques to length 5. In related work, Kruengkrai etal: [12] showed employed and language similarities. a similar result when classifying 17 languages with average III. METHODS length of 50 bytes, while ignoring character-encoding system during processing (that is, irrespective of the number of In this section, we describe the n-gram based classifiers and characters, 50 bytes of data were used). They achieved an smoothing techniques employed in further experiments. accuracy of 99.7% with an SVM classifier. A. LID using a Na¨ıve Bayes classifier Classification of a short textual fragment is more complex A Na¨ıve Bayes classifier uses the concept of Bayes’ the- due to the lack of contextual information. Short text seg- orem [19]. This classifier assigns the most likely class to ments include proper names, generic words in isolation and an input string, based on the highest a posteriori probability, very short sentences (less than approximately 15 characters). given the input string. For T-LID, a Na¨ıve Bayes classifier Vatanen etal: [9] used the Cavnar ranking method and SVMs can be constructed using n-grams as features. Let T be a set to identify short text segments. They experimented with 281 of training samples and let each sample be represented by n languages using a fairly small training set, and for test samples feature vectors, X = x ; x ; :::; x , with their class labels. Let within the range of 5-21 characters, they obtained an accuracy 1 2 n there be m classes: K ;K ; :::::; K . To predict, a sample X of less than 90%. Similarly, Bhargava and Kondrak [3] used 1 2 m is selected to belong to class K , if and only if: SVMs to classify proper names while training on a small data i set of 900 names and testing on 100 names. They obtained P (Ki j X) > P (Kj j X); for 1 ≤ j ≤ m; j 6= i (1) their best identification rate of 84% using a support vector machine with a radial basis function (RBF). where P (Ki j X) is the probability of a class Ki given a Not all methods can be applied to words in isolation, sample. Bayes’ theorem states that: with linguistic models (such as the stop words used by P (X j K )P (K ) P (K j X) = i i (2) Johnson [16] or the closed grammatical classes used by Lins i P (X) and Gonçalves [17]) not applicable to this task. One technique where P (X j K ) represents the likelihood of a sample X that is not n-gram based that is worth mentioning, is the use of i belonging to class K , and P (X) does not influence model a data compression model for LID, as introduced by Hategan i comparison. The class a priori probability, P (K ), represents etal: [15]. They evaluated the performance of the algorithm i the count relative frequency in the sample set. According on individual names and isolated words from 6 European to the Na¨ıve Bayes assumption, statistical independence of languages, and reported an accuracy of above 80% on two- features is assumed, and the class K is selected such that best results. i Q P (x j K )P (K ) P (x jK ) N-gram based method has been compared directly to other j j i i is optimised, where j i is then LID approaches in a number of studies. Hakkinen and Tien the likelihood of a specific n-gram being observed in a given [13] compared a decision tree and n-gram methods. They language, and the word being classified consists of j n-grams. concluded that the n-gram based method perform better on B. Support Vector Machines longer text samples while decision trees do better on short Support vector machines estimate a linear hyper-plane, words like proper names. They also emphasised that the which separates two binary classifiers while maximising the decision tree method does well with learning lexical structure distance from the hyper-plane to the class samples.

N-Gram Based Language Identification of Individual Words

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support