A Rule Based Approach for Lemmatisation of Punjabi Text Documents

Total Page:16

File Type:pdf, Size:1020Kb

A Rule Based Approach for Lemmatisation of Punjabi Text Documents 216 A Rule based approach for lemmatisation of Punjabi text Documents Dr. Rajeev Puri Assistant Professor, Dept of Comp. Sc, DAV College, Jalandhar. [email protected] Keywords: Punjabi lemmatiser, stemming, information retrieval ABSTRACT related meaning for the given word. The Information retrieval is a trivial task lemmatisation provides accurate results specially when the information is then a stemmer. However Punjabi being available in highly inflectional languages an inflectional language makes the task such as Hindi and Punjabi. The huge of lemmatisation more complex. Behind corpus needs to be reduced significantly each Punjabi world there are many so as to speed up the information inflectional forms as well as synonyms retrieval process as well as for generating which need to be dealt with. The Punjabi the accurate results. The pre-processing lemmatiser presented in this text is an phase of the information retrieval process attempt to obtain the lemma of the word deals with reducing the size of corpus. by using the rule-based approach. Many pre-processing steps such as stop 1. Introduction word removal, stemming and Lemmatization is process of grouping lemmatisation need to be performed for together the different inflected forms of a improving the accuracy of information word so that they can be analyzed as a retrieval task. Where the stemming single item. It is used for finding out the process converts the inflectional forms of normalized form of the word or the word by stripping the prefix or postfix, on dictionary form of the word which is called the other hand the lemmatisation is the lemma. Lexeme is a group of all the process of generating normalised forms (called headwords of a dictionary) dictionary word that has the same or that have the same meaning, and lemma Research Cell : An International Journal of Engineering Sciences, Special Issue March 2018, Vol. 27 UGC Approved Journal (S.No.63019) ISSN: 2229-6913 (Print), ISSN: 2320-0332 (Online) -, Web Presence: http://www.ijoes.vidyapublications.com © 2018 Vidya Publications. Authors are responsible for any plagiarism issues. 217 is a word that is chosen as a base form ਿਠਹਰ (Stop) ਠਿਹਰਾ, ਠਿਹਰ&, ਠਿਹਿਰਆ, that represents lexeme. Because of the ਠਿਹਰਾਅ inflectional nature of Punjabi, it makes the task of Lemmatization more complex than ਰ#5 (Color) ਰ#5ਣਾ, ਰਗਣ, ਰਗੀਨ, ਰ#5ੀਨੀ, Stemming. Both Stemming and ਰ#5ਾਉਣਾ, ਰ#5ਵਾਉਣਾ, ਰ#5ਾਈ, Lemmatization are the important ਰ#5ਵਾਈ, ਰ#5ਲਾ, ਰ#5ੀਲਾ preprocessing steps in Information Retrieval task. It is also used in many 2. Literature Survey applications of text mining. It plays an Taghva et al. [6] have developed important role in Natural Language Stemmer for Farsi language. Their Processing and many other linguistics system uses Deterministic Finite fields. It helps the search engines in automata for implementation of stemmer. finding out the keywords and helps in The minimum length of the stem is limited reducing the size of index files. For to three words. Sarkar et al. [10] building a Lemmatizer one requires proposed a Rule Based Stemmer for proper understanding of the language Bengali language which is also a and dictionary of that particular language. resource poor language. The POS Indian languages are highly inflectional in specific rules are applied to find out the nature. Punjabi is also highly inflectional stem of the word. Kraai et al. [4] have language. Various inflectional forms of developed Suffix Stripper for Dutch by Punjabi words are shown in table 1 using Porter’s Algorithm. They extend the below. original Porter’s Algorithm to handle Table 1: Sample Inflectional forms of prefixes and infixes. Lovins. [1] proposed Punjabi Text a Stemming Algorithm for English. The Punjabi Word Inflectional forms Algorithm has been developed to be used ਮuਡਾ (Boy) ਮuਡ&, ਮuਿਡਆ, ਮuਿਡਆਂ , ਮuਿਡਓ in Library Information Transfer Systems. ਕਰ (Do) ਕਰਿਦਆਂ, ਕਰਦੀਆਂ, ਕਰ/, ਕਰ0, Bijal et al. [18] discussed different stemming Algorithms for non-Indian and ਕਰਦਾ, ਕਰ1, ਕਰਦ0 Indian Languages. They presented a Research Cell : An International Journal of Engineering Sciences, Special Issue March 2018, Vol. 27 UGC Approved Journal (S.No.63019) ISSN: 2229-6913 (Print), ISSN: 2320-0332 (Online) -, Web Presence: http://www.ijoes.vidyapublications.com © 2018 Vidya Publications. Authors are responsible for any plagiarism issues. 218 comparative study of various Stemming on non-statistical approach. Mishra et al. Algorithms. Ganguly et al. [16] have [14] have developed MAULIK: an developed two rule based stemmers for effective Stemmer for Hindi based on Bengali and Hindi. Gupta et al. [11] have Hybrid Approach. The Hybrid Approach is developed Punjabi Language stemmer for combination of Brute Force and Suffix Nouns and Proper Names. First they Removal Approach. This Stemmer is generate the stem of the Punjabi word based on Devanagari Script. and after that stem is checked in Punjabi 3. Research Gap Noun and Proper Name Dictionary. Hafer Punjabi is an indo-aryan Language. It is et al. [2] have developed word spoken by more than 100 million native segmentation by letter successor people worldwide. It is the 10th most varieties method. They used Statistical widely spoken Language in the world. properties of the corpus to decide The popularity and online presence of whether word should be divided or not. Punjabi language is increasing day by The approach is based upon Z.S Harris day as more and more Punjabi corpus Process. Majumder et al. [9] have are generated and published day by day developed a Stemmer YASS (yet another over internet. So a system is required for suffix stripper) which is based on efficient searching of Punjabi documents clustering based approach. Cluster is a over the web. The searching should be in class of root word and their related word native’s own language. For faster and forms. Puri et al. [19] have developed a efficient information retrieval system in Punjabi Stemmer using Punjabi WordNet Punjabi, some pre-processing is required Database. In this paper Suffix Removal before indexing the documents. This pre- Approach is used with suffix stripping processing includes stop word removal, rules for finding out the stem for Punjabi stemming and a sophisticated Language. EI-Shishtawy et al. [13] have Lemmatizer for reducing the words to developed an accurate Arabic Root- their base forms called headwords. Based Lemmatizer for Information Various Lemmatization and Stemming Retrieval Purpose. This is the first systems are available for different Lemmatizer for Arabic Language based languages, but very less work has been Research Cell : An International Journal of Engineering Sciences, Special Issue March 2018, Vol. 27 UGC Approved Journal (S.No.63019) ISSN: 2229-6913 (Print), ISSN: 2320-0332 (Online) -, Web Presence: http://www.ijoes.vidyapublications.com © 2018 Vidya Publications. Authors are responsible for any plagiarism issues. 219 done for Punjabi because of its will be applied on the word ਰ05ੀ (12) and it inflectional nature. For every word in becomes ਰ05 (9), and that is the Punjabi, there are number of inflectional normalized word, also a valid word forms. So, attempt here has been made available in the dictionary and displayed to develop a Lemmatizer for Punjabi as an output on the output screen. The which will generate accurate lemma for proposed system uses a number of most of the Punjabi words. databases and lists for accomplishing the 4. Proposed System task. These databases and lists are A proposed system uses rule-based mentioned in the coming sections. approach to find out the normalized form 5. Synonyms Database of the word that helps in improving the Synonyms are the words that have the results of Information Retrieval task. The same meaning. A synonym database is system works with the Synonym prepared with augmented field containing replacement module that uses pre- the length of the word for quick defined rules to get the lemma of the comparison. The database is indexed on word. the word and sorted according to the For e.g:- Word:- ਿਬਮਾਰ length of each word in ascending order Synonym Words:- , ਰ05ੀ, ਮਰੀਜ, ਮਰੀਜ਼, for quick retreival. For example, the word ਰ05ਗ?@ਤ “ਮੂਰਖ” (Fool) is having 26 distinct The word ਿਬਮਾਰ has length 15, is the input synonyms as "ਜਾਹਲ", "ਉਜੱਡ", "ਜੜ-", "ਗਵਾਰ", word for the system. The list of synonym is collected from the database and then "ਭੁਚੱ ", "ਘੋਗਾ", "ਬੇਸ਼ਮਝ", "ਮੂੜ-", "ਅਬੁੱਧ", "ਨਾਸਮਝ", system picks the shortest length word "ਘੁੱਗੂ", "ਘੁੱਘੂ", "ਝੁਡੱ ੂ", "ਬੇਵਕੂਫ", "ਅਿਗਆਨੀ", from the list that is ਰ05ੀ (12) with length 12. After that the rules from the rule list "ਮੱਤਹੀਣ", "ਲੱਲ-ਾ", "ਅੰਞਾਣਾ", "ਜੜ_ਮੱਤ", "ਮੂੜਮ- ੱਤ", are applied on the synonym word to get "ਬੁੱਧੀਹੀਣ", "ਿਨਰਬੁੱਧੀ", "ਲਘੂ_ਬੁੱਧੀ", "ਅਲਪ_ਬੁੱਧੀ", the accurate lemma word as an output. "ਘੱਟ_ਬੁੱਧੀ". Here, the ‘B’ => “” rule of suffix stripping Research Cell : An International Journal of Engineering Sciences, Special Issue March 2018, Vol. 27 UGC Approved Journal (S.No.63019) ISSN: 2229-6913 (Print), ISSN: 2320-0332 (Online) -, Web Presence: http://www.ijoes.vidyapublications.com © 2018 Vidya Publications. Authors are responsible for any plagiarism issues. 220 Out of all these distinct synonyms, the ਕਾਰੀ ਕਾਰਜਕਾਰੀ , ਿਹਤਕਾਰੀ , ਲਾਭਕਾਰੀ synonym “ਜਾਹਲ” has the minimum length ਣਗ& ਚਲਣਗ& , ਰuਕਣਗ& , 1ਖਣਗ& of 12 characters and hence is chosen as L(ਂ ਹ#=/(ਂ , ਆਗ/(ਂ a replacement of all other peer words. The synonym replacement module has ਗਾ ਆਵ&5ਾ , ਜਾਵ&5ਾ, ਕਰ&5ਾ , ਭਰ&ਗਾ some inherent problems of incorrectly replacing a named entity with it’s The list of words having the common synonym. For example, the name “C+ਾਸ਼ suffixes is prepared and frequency of valid suffixes is calculated.
Recommended publications
  • Enhanced Thesaurus Terms Extraction for Document Indexing
    Enhanced Thesaurus Terms Extraction for Document Indexing Frane ari¢, Jan najder, Bojana Dalbelo Ba²i¢, Hrvoje Ekli¢ Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, 10000 Zagreb, Croatia E-mail:{Frane.Saric, Jan.Snajder, Bojana.Dalbelo, Hrvoje.Eklic}@fer.hr Abstract. In this paper we present an mogeneous due to diverse background knowl- enhanced method for the thesaurus term edge and expertise of human indexers. The extraction regarded as the main support to task of building semi-automatic and auto- a semi-automatic indexing system. The matic systems, which aim to decrease the enhancement is achieved by neutralising burden of work borne by indexers, has re- the eect of language morphology applying cently attracted interest in the research com- lemmatisation on both the text and the munity [4], [13], [14]. Automatic indexing thesaurus, and by implementing an ecient systems still do not achieve the performance recursive algorithm for term extraction. of human indexers, so semi-automatic sys- Formal denition and statistical evaluation tems are widely used (CINDEX, MACREX, of the experimental results of the proposed MAI [10]). method for thesaurus term extraction are In this paper we present a method for the- given. The need for disambiguation methods saurus term extraction regarded as the main and the eect of lemmatisation in the realm support to semi-automatic indexing system. of thesaurus term extraction are discussed. Term extraction is a process of nding all ver- batim occurrences of all terms in the text. Keywords. Information retrieval, term Our method of term extraction is a part of extraction, NLP, lemmatisation, Eurovoc.
    [Show full text]
  • An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification
    Imperial College London Department of Computing An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification Supervisors: Author: Prof Alessandra Russo Clavance Lim Nuri Cingillioglu Submitted in partial fulfillment of the requirements for the MSc degree in Computing Science of Imperial College London September 2019 Contents Abstract 1 Acknowledgements 2 1 Introduction 3 1.1 Motivation .................................. 3 1.2 Aims and objectives ............................ 4 1.3 Outline .................................... 5 2 Background 6 2.1 Overview ................................... 6 2.1.1 Text classification .......................... 6 2.1.2 Training, validation and test sets ................. 6 2.1.3 Cross validation ........................... 7 2.1.4 Hyperparameter optimization ................... 8 2.1.5 Evaluation metrics ......................... 9 2.2 Text classification pipeline ......................... 14 2.3 Feature extraction ............................. 15 2.3.1 Count vectorizer .......................... 15 2.3.2 TF-IDF vectorizer ......................... 16 2.3.3 Word embeddings .......................... 17 2.4 Classifiers .................................. 18 2.4.1 Naive Bayes classifier ........................ 18 2.4.2 Decision tree ............................ 20 2.4.3 Random forest ........................... 21 2.4.4 Logistic regression ......................... 21 2.4.5 Support vector machines ...................... 22 2.4.6 k-Nearest Neighbours .......................
    [Show full text]
  • Using Lexico-Syntactic Ontology Design Patterns for Ontology Creation and Population
    Using Lexico-Syntactic Ontology Design Patterns for ontology creation and population Diana Maynard and Adam Funk and Wim Peters Department of Computer Science University of Sheffield Regent Court, 211 Portobello S1 4DP, Sheffield, UK Abstract. In this paper we discuss the use of information extraction techniques involving lexico-syntactic patterns to generate ontological in- formation from unstructured text and either create a new ontology from scratch or augment an existing ontology with new entities. We refine the patterns using a term extraction tool and some semantic restrictions derived from WordNet and VerbNet, in order to prevent the overgener- ation that occurs with the use of the Ontology Design Patterns for this purpose. We present two applications developed in GATE and available as plugins for the NeOn Toolkit: one for general use on all kinds of text, and one for specific use in the fisheries domain. Key words: natural language processing, relation extraction, ontology generation, information extraction, Ontology Design Patterns 1 Introduction Ontology population is a crucial part of knowledge base construction and main- tenance that enables us to relate text to ontologies, providing on the one hand a customised ontology related to the data and domain with which we are con- cerned, and on the other hand a richer ontology which can be used for a variety of semantic web-related tasks such as knowledge management, information re- trieval, question answering, semantic desktop applications, and so on. Automatic ontology population is generally performed by means of some kind of ontology-based information extraction (OBIE) [1, 2]. This consists of identi- fying the key terms in the text (such as named entities and technical terms) and then relating them to concepts in the ontology.
    [Show full text]
  • LASLA and Collatinus
    L.A.S.L.A. and Collatinus: a convergence in lexica Philippe Verkerk, Yves Ouvrard, Margherita Fantoli, Dominique Longrée To cite this version: Philippe Verkerk, Yves Ouvrard, Margherita Fantoli, Dominique Longrée. L.A.S.L.A. and Collatinus: a convergence in lexica. Studi e saggi linguistici, ETS, In press. hal-02399878v1 HAL Id: hal-02399878 https://hal.archives-ouvertes.fr/hal-02399878v1 Submitted on 9 Dec 2019 (v1), last revised 14 May 2020 (v2) HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. L.A.S.L.A. and Collatinus: a convergence in lexica Philippe Verkerk, Yves Ouvrard, Margherita Fantoli and Dominique Longrée L.A.S.L.A. (Laboratoire d'Analyse Statistique des Langues Anciennes, University of Liège, Belgium) has begun in 1961 a project of lemmatisation and morphosyntactic tagging of Latin texts. This project is still running with new texts lemmatised each year. The resulting files have been recently opened to the interested scholars and they now count approximatively 2.500.000 words, the lemmatisation of which has been checked by a philologist. In the early 2.000's, Collatinus has been developed by Yves Ouvrard for teaching.
    [Show full text]
  • Experiments in Clustering Urban-Legend Texts
    How to Distinguish a Kidney Theft from a Death Car? Experiments in Clustering Urban-Legend Texts Roman Grundkiewicz, Filip Gralinski´ Adam Mickiewicz University Faculty of Mathematics and Computer Science Poznan,´ Poland [email protected], [email protected] Abstract is much easier to tap than the oral one, it becomes feasible to envisage a system for the machine iden- This paper discusses a system for auto- tification and collection of urban-legend texts. In matic clustering of urban-legend texts. Ur- this paper, we discuss the first steps into the cre- ban legend (UL) is a short story set in ation of such a system. We concentrate on the task the present day, believed by its tellers to of clustering of urban legends, trying to reproduce be true and spreading spontaneously from automatically the results of manual categorisation person to person. A corpus of Polish UL of urban-legend texts done by folklorists. texts was collected from message boards In Sec. 2 a corpus of 697 Polish urban-legend and blogs. Each text was manually as- texts is presented. The techniques used in pre- signed to one story type. The aim of processing the corpus texts are discussed in Sec. 3, the presented system is to reconstruct the whereas the clustering process – in Sec. 4. We manual grouping of texts. It turned out that present the clustering experiment in Sec. 5 and the automatic clustering of UL texts is feasible results – in Sec. 6. but it requires techniques different from the ones used for clustering e.g. news ar- 2 Corpus ticles.
    [Show full text]
  • Removing Boilerplate and Duplicate Content from Web Corpora
    Masaryk University Faculty}w¡¢£¤¥¦§¨ of Informatics !"#$%&'()+,-./012345<yA| Removing Boilerplate and Duplicate Content from Web Corpora Ph.D. thesis Jan Pomik´alek Brno, 2011 Acknowledgments I want to thank my supervisor Karel Pala for all his support and encouragement in the past years. I thank LudˇekMatyska for a useful pre-review and also for being so kind and cor- recting typos in the text while reading it. I thank the KrdWrd team for providing their Canola corpus for my research and namely to Egon Stemle for his help and technical support. I thank Christian Kohlsch¨utterfor making the L3S-GN1 dataset publicly available and for an interesting e-mail discussion. Special thanks to my NLPlab colleagues for creating a great research environment. In particular, I want to thank Pavel Rychl´yfor inspiring discussions and for an interesting joint research. Special thanks to Adam Kilgarriff and Diana McCarthy for reading various parts of my thesis and for their valuable comments. My very special thanks go to Lexical Computing Ltd and namely again to the director Adam Kilgarriff for fully supporting my research and making it applicable in practice. Last but not least I thank my family for all their love and support throughout my studies. Abstract In the recent years, the Web has become a popular source of textual data for linguistic research. The Web provides an extremely large volume of texts in many languages. However, a number of problems have to be resolved in order to create collections (text corpora) which are appropriate for application in natural language processing. In this work, two related problems are addressed: cleaning a boilerplate and removing duplicate and near-duplicate content from Web data.
    [Show full text]
  • The State of (Full) Text Search in Postgresql 12
    Event / Conference name Location, Date The State of (Full) Text Search in PostgreSQL 12 FOSDEM 2020 Jimmy Angelakos Senior PostgreSQL Architect Twitter: @vyruss https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Contents ● (Full) Text Search ● Indexing ● Operators ● Non-natural text ● Functions ● Collation ● Dictionaries ● Other “text” types ● Examples ● Maintenance https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Your attention please Allergy advice ● This presentation contains linguistics, NLP, Markov chains, Levenshtein distances, and various other confounding terms. ● These have been known to induce drowsiness and inappropriate sleep onset in lecture theatres. https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 What is Text? (Baby don’t hurt me) ● PostgreSQL character types – CHAR(n) – VARCHAR(n) – VARCHAR, TEXT ● Trailing spaces: significant (e.g. for LIKE / regex) ● Storage – Character Set (e.g. UTF-8) – 1+126 bytes → 4+n bytes – Compression, TOAST https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 What is Text Search? ● Information retrieval → Text retrieval ● Search on metadata – Descriptive, bibliographic, tags, etc. – Discovery & identification ● Search on parts of the text – Matching – Substring search – Data extraction, cleaning, mining https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 Text search operators in PostgreSQL ● LIKE, ILIKE (~~, ~~*) ● ~, ~* (POSIX regex) ● regexp_match(string text, pattern text) ● But are SQL/regular expressions enough? – No ranking of results – No concept of language – Cannot be indexed ● Okay okay, can be somewhat indexed* ● SIMILAR TO → best forget about this one https://www.2ndQuadrant.com FOSDEM Brussels, 2020-02-02 What is Full Text Search (FTS)? ● Information retrieval → Text retrieval → Document retrieval ● Search on words (on tokens) in a database (all documents) ● No index → Serial search (e.g.
    [Show full text]
  • A Diachronic Treebank of Russian Spanning More Than a Thousand Years
    Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 5251–5256 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC A Diachronic Treebank of Russian Spanning More Than a Thousand Years Aleksandrs Berdicevskis1, Hanne Eckhoff2 1Språkbanken (The Swedish Language Bank), Universify of Gothenburg 2Faculty of Medieval and Modern Languages, University of Oxford [email protected], [email protected] Abstract We describe the Tromsø Old Russian and Old Church Slavonic Treebank (TOROT) that spans from the earliest Old Church Slavonic to modern Russian texts, covering more than a thousand years of continuous language history. We focus on the latest additions to the treebank, first of all, the modern subcorpus that was created by a high-quality conversion of the existing treebank of contemporary standard Russian (SynTagRus). Keywords: Russian, Slavonic, Old Church Slavonic, treebank, diachrony, conversion and a richer inventory of syntactic relation labels. 1. Introduction Secondary dependencies are used to indicate external The Tromsø Old Russian and OCS Treebank (TOROT, subjects, for example in control structures, and to indicate Eckhoff and Berdicevskis 2015) has been available in shared dependents, for example in structures with various releases since its beginnings in 2013, and its East coordinated verbs. Empty nodes are allowed to give more Slavonic part now contains approximately 230K words.1 information on elliptic structures and asyndetic This paper describes the TOROT 20200116 release,2 which coordination. The empty nodes are limited to verbs and adds a conversion of the SynTagRus treebank. Former conjunctions, the scheme does not allow empty nominal TOROT releases consisted of Old East Slavonic and nodes.
    [Show full text]
  • Download in the Conll-Format and Comprise Over ±175,000 Tokens
    Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning Mike Kestemont1, Jeroen De Gussem2 1 University of Antwerp, Belgium 2 Ghent University, Belgium *Corresponding author: [email protected] Abstract In this paper we consider two sequence tagging tasks for medieval Latin: part-of-speech tagging and lemmatization. These are both basic, yet foundational preprocessing steps in applications such as text re-use detection. Nevertheless, they are generally complicated by the considerable orthographic variation which is typical of medieval Latin. In Digital Classics, these tasks are traditionally solved in a (i) cascaded and (ii) lexicon-dependent fashion. For example, a lexicon is used to generate all the potential lemma-tag pairs for a token, and next, a context-aware PoS-tagger is used to select the most appropriate tag-lemma pair. Apart from the problems with out-of-lexicon items, error percolation is a major downside of such approaches. In this paper we explore the possibility to elegantly solve these tasks using a single, integrated approach. For this, we make use of a layered neural network architecture from the field of deep representation learning. keywords tagging, lemmatization, medieval Latin, LSTM, deep learning INTRODUCTION Latin —and its historic variants in particular— have long been a topic of major interest in Natural Language Processing [Piotrowksi 2012]. Especially in the community of Digital Humanities, the automated processing of Latin texts has always been a popular research topic. In a variety of computational applications, such as text re-use detection [Franzini et al, 2015], it is desirable to annotate and augment Latin texts with useful morpho-syntactical or lexical information, such as lemmas.
    [Show full text]
  • Lemmagen: Multilingual Lemmatisation with Induced Ripple-Down Rules
    Journal of Universal Computer Science, vol. 16, no. 9 (2010), 1190-1214 submitted: 21/1/10, accepted: 28/4/10, appeared: 1/5/10 © J.UCS LemmaGen: Multilingual Lemmatisation with Induced Ripple-Down Rules MatjaˇzJurˇsiˇc (Joˇzef Stefan Institute, Ljubljana, Slovenia [email protected]) Igor Mozetiˇc (Joˇzef Stefan Institute, Ljubljana, Slovenia [email protected]) TomaˇzErjavec (Joˇzef Stefan Institute, Ljubljana, Slovenia [email protected]) Nada Lavraˇc (Joˇzef Stefan Institute, Ljubljana, Slovenia and University of Nova Gorica, Nova Gorica, Slovenia [email protected]) Abstract: Lemmatisation is the process of finding the normalised forms of words appearing in text. It is a useful preprocessing step for a number of language engineering and text mining tasks, and especially important for languages with rich inflectional morphology. This paper presents a new lemmatisation system, LemmaGen, which was trained to generate accurate and efficient lemmatisers for twelve different languages. Its evaluation on the corresponding lexicons shows that LemmaGen outperforms the lemmatisers generated by two alternative approaches, RDR and CST, both in terms of accuracy and efficiency. To our knowledge, LemmaGen is the most efficient publicly available lemmatiser trained on large lexicons of multiple languages, whose learning engine can be retrained to effectively generate lemmatisers of other languages. Key Words: rule induction, ripple-down rules, natural language processing, lemma- tisation Category: I.2.6, I.2.7, E.1 1 Introduction Lemmatisation is a valuable preprocessing step for a large number of language engineering and text mining tasks. It is the process of finding the normalised forms of wordforms, i.e. inflected words as they appear in text.
    [Show full text]
  • Language Technology Meets Documentary Linguistics: What We Have to Tell Each Other
    Language technology meets documentary linguistics: What we have to tell each other Language technology meets documentary linguistics: What we have to tell each other Trond Trosterud Giellatekno, Centre for Saami Language Technology http://giellatekno.uit.no/ . February 15, 2018 . Language technology meets documentary linguistics: What we have to tell each other Contents Introduction Language technology for the documentary linguist Language technology for the language society Conclusion . Language technology meets documentary linguistics: What we have to tell each other Introduction Introduction I Giellatekno: started in 2001 (UiT). Research group for language technology on Saami and other northern languages Gramm. modelling, dictionaries, ICALL, corpus analysis, MT, ... I Trond Trosterud, Lene Antonsen, Ciprian Gerstenberger, Chiara Argese I Divvun: Started in 2005 (UiT < Min. of Local Government). Infrastructure, proofing tools, synthetic speech, terminology I Sjur Moshagen, Thomas Omma, Maja Kappfjell, Børre Gaup, Tomi Pieski, Elena Paulsen, Linda Wiechetek . Language technology meets documentary linguistics: What we have to tell each other Introduction The most important languages we work on . Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist Language technology for documentary linguistics ... Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist ... what’s in it for the language community I work for? I Let’s pretend there are two types of language communities: 1. Language communities without plans for revitalisation or use in domains other than oral use 2. Language communities with such plans . Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist Language communities without such plans I Gather empirical material and do your linguistic analysis I (The triplet: Grammar, text collection and dictionary) I ..
    [Show full text]
  • A 500 Million Word POS-Tagged Icelandic Corpus
    A 500 Million Word POS-Tagged Icelandic Corpus Thomas Eckart1, Erla Hallsteinsdóttir², Sigrún Helgadóttir³, Uwe Quasthoff1, Dirk Goldhahn1 1Natural Language Processing Group, University of Leipzig, Germany ² Department of Language and Communication, University of Southern Denmark, Odense, Denmark ³ The Árni Magnússon Institute for Icelandic Studies, Reykjavík, Iceland Email: {teckart, quasthoff, dgoldhahn}@informatik.uni-leipzig.de, [email protected], [email protected] Abstract The new POS-tagged Icelandic corpus of the Leipzig Corpora Collection is an extensive resource for the analysis of the Icelandic language. As it contains a large share of all Web documents hosted under the .is top-level domain, it is especially valuable for investigations on modern Icelandic and non-standard language varieties. The corpus is accessible via a dedicated web portal and large shares are available for download. Focus of this paper will be the description of the tagging process and evaluation of statistical properties like word form frequencies and part of speech tag distributions. The latter will be in particular compared with values from the Icelandic Frequency Dictionary (IFD) Corpus. Keywords: Corpus Creation, Part-of-Speech Tagging, Grammar and Syntax 1. History of the Icelandic Corpus Larger Icelandic corpora have been part of the Leipzig Corpora Collection (LCC) since 2005. The aim of the project is to generate large monolingual corpora based on various material of different genre, where the biggest resources are Web texts provided by the National and University Library of Iceland from autumn 2005 and autumn 2010 (approx. 33 million sentences). Moreover, additional newspaper texts (2 million sentences) and the complete Icelandic Wikipedia is included.
    [Show full text]