Data Science in Light of Natural Language Processing: an Overview

Home , Corpus linguistics

Available online at www.sciencedirect.com Available online at www.sciencedirect.com ScienceDirect AvailableScienceDirect online at www.sciencedirect.com Procedia Computer Science 00 (2018) 000–000

Procedia Computer Science 00 (2018) 000–000 www.elsevier.com/locate/procedia ScienceDirect www.elsevier.com/locate/procedia Procedia Computer Science 127 (2018) 82–91

The First International Conference On Intelligent Computing in Data Sciences The First International Conference On Intelligent Computing in Data Sciences Data science in light of natural language processing: An overview Data science in light of natural language processing: An overview Imad Zerouala* and Abdelhak Lakhouajaa Imad Zerouala* and Abdelhak Lakhouajaa aFaculty of Sciences, Mohamed First University, Av Med VI BP 717, Oujda 60000, Morocco aFaculty of Sciences, Mohamed First University, Av Med VI BP 717, Oujda 60000, Morocco

Abstract Abstract The focus of data scientists is essentially divided into three areas: collecting data, analyzing data, and inferring information from data.The focus Each ofone data of thesescientists tasks is requires essentially special divided personnel, into three takes areas: time, collecting and costs data, money. analyzing Yet, the data, next and an dinferring the fastidious informat stepion is fromhow data.to turn Each data one into of products. these tasks Therefore, requires this special field personnel, grabs the attentiontakes time, of andmany costs research money. groups Yet, thein academia next and asthe well fastidious as industry step .is In how the tolast turn decades, data into data products.-driven approachesTherefore, this came field into grabs existence the attention and gained of many more research popularity groups because in academia they requireas well asmuch industry less .human In the lasteffort. decades, Natural data Language-driven Processingapproaches (NLP)came intois strongly existence among and gainedthe fields more influenced popularity by becausedata. The they growth require of datamuch is lessbehind human the performanceeffort. Natural improvement Language Processing of most (NLP) NLP isapplications strongly among such theas fieldsmachine influenced translation by data. and The automatic growth ofspeech data isrecogn behindition. the Consequently,performance improvement many NLP applications of most NLPare frequently applications moving such from as rulemachine-based translationsystems and and knowledge automatic-based speech methods recogn to ition.data- drivenConsequently, approaches. many However, NLP applications collected dataare frequentlythat are based moving on undefined from rule design-based criteria systems or and on technicallyknowledge -unsuitablebased methods forms to wil datal be- udrivenseless. approaches. Also, they However, will be neglectedcollected dataif the that size are isbased not onenough undefined to perform design criteriathe required or on technicallyanalysis and unsuitable to infer formsthe accurate will be information.useless. Also, The they chief will purpose be neglected of this overviewif the size is tois shednot someenough lights to performon the vital the rolerequired of data analysis in various and fields to infer and thegive accurate a better uinformation.nderstanding The of chiefdata inpurpose light ofof NLP.this overview Expressly, is toit sheddescribes some whatlights happen on the vitalto data role during of data its in life various-cycle: fields building, and give processing, a better analyzing,understanding and ofexploring data in phases.light of NLP. Expressly, it describes what happen to data during its life-cycle: building, processing, analyzing, and exploring phases.

© 2018 The Authors. Published by Elsevier B.V. © 2018 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/). Selection and © 2018 The Authors. Published by Elsevier B.V. Thispeer-review is an open under access responsibility article under of theInternational CC BY-NC Neural-ND licenseNetwork (http://creativecommons.org/licenses/by Society Morocco Regional Chapter. -nc-nd/3.0/). Selection andThis peer is an-review open access under articleresponsibility under the of CCInternational BY-NC-ND Neural license Network (http://creativecommons.org/licenses/by Society Morocco Regional Chapter.-nc -nd/3.0/). Selection andKeywords: peer-review Data science; under Naturalresponsibility language of processing; International Data Neuraldriven approches; Network Corpora;Society MoroccoMachine learning Regional Chapter. Keywords: Data science; Natural language processing; Data driven approches; Corpora; Machine learning

1. Introduction 1. Introduction Recently, a variety of Natural Language Processing (NLP) applications are based on data-driven methods such as neuralRecently, networks a variety and Hidden of Natural Markow Language Models Processing (HMMs) (NLP) [1]. Since applications the progress are based in most on data of -thesedriven methods methods is such driven as neural networks and Hidden Markow Models (HMMs) [1]. Since the progress in most of these methods is driven

* Corresponding author. Tel.: +212618417420. *E -Correspondingmail address: [email protected] author. Tel.: +212618417420. E-mail address: [email protected] 1877-0509© 2018 The Authors. Published by Elsevier B.V. This1877 -is0509© an open 2018 access The Authors. article Publishedunder the by CC Elsevier BY-NC B.V.-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/). Selection andThis peer is an-review open access under articleresponsibility under the of CCInternational BY-NC-ND Neural license Network (http://creativecommons.org/licenses/by Society Morocco Regional Chapter.-nc -nd/3.0/). Selection and peer-review under responsibility of International Neural Network Society Morocco Regional Chapter.

1877-0509 © 2018 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/). Selection and peer-review under responsibility of International Neural Network Society Morocco Regional Chapter. 10.1016/j.procs.2018.01.101

10.1016/j.procs.2018.01.101 1877-0509 Available online at www.sciencedirect.com Available online at www.sciencedirect.com ScienceDirect Imad Zeroual et al. / Procedia Computer Science 127 (2018) 82–91 83 ScienceDirect Imad Zeroual / Procedia Computer Science 00 (2018) 000–000 Procedia Computer Science 00 (2018) 000–000

Procedia Computer Science 00 (2018) 000–000 www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia The First International Conference On Intelligent Computing in Data Sciences The First International Conference On Intelligent Computing in Data Sciences Data science in light of natural language processing: An overview Data science in light of natural language processing: An overview Imad Zerouala* and Abdelhak Lakhouajaa Imad Zerouala* and Abdelhak Lakhouajaa a Faculty of Sciences, Mohamed First University, Av Med VI BP 717, Oujda 60000, Morocco aFaculty of Sciences, Mohamed First University, Av Med VI BP 717, Oujda 60000, Morocco dictionaries such as the “Dictio Abstract Scottish Tongue”, the “Middle English Dictionary”, the “Dictionary of Old English”, and the “Oxford English Abstract Dictionary” The focus of data scientists is essentially divided into three areas: collecting data, analyzing data, and inferring information from data.The focus Each ofone data of thesescientists tasks is requires essentially special divided personnel, into three takes areas: time, collecting and costs data, money. analyzing Yet, the data, next and an dinferring the fastidious informat stepion is fromhow todata. turn Each data one into of products. these tasks Therefore, requires this special field personnel, grabs the attentiontakes time, of andmany costs research money. groups Yet, thein academia next and asthe well fastidious as industry step .is In how the Nation lastto turn decades, data into data products.-driven approachesTherefore, this came field into grabs existence the attention and gained of many more research popularity groups because in academia they requireas well asmuch industry less .human In the effort.last decades, Natural data Language-driven Processingapproaches (NLP)came intois strongly existence among and gainedthe fields more influenced popularity by becausedata. The they growth require of datamuch is lessbehind human the may not need to become a part of the learners’ output or the teacher performanceeffort. Natural improvement Language Processing of most (NLP) NLP isapplications strongly among such theas fieldsmachine influenced translation by data. and The automatic growth ofspeech data isrecogn behindition. the performance improvement of most NLP applications such as machine translation and automatic speech recognition. Consequently, many NLP applications are frequently moving from rule-based systems and knowledge-based methods to data- drivenConsequently, approaches. many However, NLP applications collected dataare frequentlythat are based moving on undefined from rule design-based criteria systems or and on technicallyknowledge -unsuitablebased methods forms to wil datal be- udrivenseless. approaches. Also, they However, will be neglectedcollected dataif the that size are isbased not onenough undefined to perform design criteriathe required or on technicallyanalysis and unsuitable to infer formsthe accurate will be information.useless. Also, The they chief will purpose be neglected of this overviewif the size is tois shednot someenough lights to performon the vital the rolerequired of data analysis in various and fields to infer and thegive accurate a better uinformation.nderstanding The of chiefdata inpurpose light ofof NLP.this overview Expressly, is toit sheddescribes some whatlights happen on the vitalto data role during of data its in life various-cycle: fields building, and give processing, a better analyzing,understanding and ofexploring data in phases.light of NLP. Expressly, it describes what happen to data during its life-cycle: building, processing, analyzing, and exploring phases. © 2018 The Authors. Published by Elsevier B.V. This© 2018 is an The open Authors. access Published article under by Elsevierthe CC BY B.V.-NC -ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/). Selection DOI andThis peer is an-review open access under articleresponsibility under the of CCInternational BY-NC-ND Neural license Network (http://creativecommons.org/licenses/by Society Morocco Regional Chapter.-nc -nd/3.0/). Selection andKeywords: peer-review Data science; under Naturalresponsibility language of processing; International Data Neuraldriven approches; Network Corpora;Society MoroccoMachine learning Regional Chapter. Keywords: Data science; Natural language processing; Data driven approches; Corpora; Machine learning 1. Introduction 1. Introduction Recently, a variety of Natural Language Processing (NLP) applications are based on data-driven methods such as 2. Life-cycle of data in NLP neuralRecently, networks a variety and Hidden of Natural Markow Language Models Processing (HMMs) (NLP) [1]. Since applications the progress are based in most on data of -thesedriven methods methods is such driven as neural networks and Hidden Markow Models (HMMs) [1]. Since the progress in most of these methods is driven * Corresponding author. Tel.: +212618417420. E* -Correspondingmail address: [email protected] author. Tel.: +212618417420. E-mail address: [email protected] 1877-0509© 2018 The Authors. Published by Elsevier B.V. Manning This1877 -is0509© an open 2018 access The Authors. article Publishedunder the by CC Elsevier BY-NC B.V.-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/). Selection andThis peer is an-review open access under articleresponsibility under the of CCInternational BY-NC-ND Neural license Network (http://creativecommons.org/licenses/by Society Morocco Regional Chapter.-nc -nd/3.0/). Selection and peer-review under responsibility of International Neural Network Society Morocco Regional Chapter. 2.1. Data design, sources, and format

a 84 Imad Zeroual et al. / Procedia Computer Science 127 (2018) 82–91 –

ith respect to the phenomena under inestigation f there are no specific criteria, the corpus should be designed for a general use to suit most applications. In the 90’s, primarily with the completion of the ritish ational orpus , basic guidelines in corpus design and compilation hae been set by releant specialists – ater, expanded these guidelines in ten fundamental criteria hich are considered as core principles such as representatieness, balance, and homogeneity ased on these design criteria, corpora builders should identify the data source genres and sie The selection and finding suitable resources is much more complicated or example, as stressed by , the components of general corpora typically are representatie of arious genres, hilst specialied corpora can be limited to highlight only one genre or a family of genres egarding the sie and balance, claims that it is not alays possible to collect data in similar or een sufficient uantities for each text category represented in the corpus this is often the case ith historical corpora oeer, seeral sources can be used to build corpora corpus, in one hand, may consist of a single boo lie the one deeloped by to build an ontology of pulmonary diseases, and the one used for commonsense noledge enhanced embeddings to sole pronoun disambiguation problems On the other hand, corpora are usually deeloped using a number of boos eg, Shamela , or editions of a particular nespaper eg, ecently, corpora builders, in particular indiiduals, use the eb to build a ery large corpora in a short time and ith lo cost oeer, e must be cautious hen draing data from the eb, especially if the aim is to build a balanced corpus in hich the language data hae to be dran from a ide range of sources egarding our surey, e proposed ellnon sources boos, eb, magaines , offering the possibility to select multiple choices or adding ne sources that are not listed The results are presented in Table

Table The used sources to build corpora

Sources of corpora umber of corpora boos boos, Official rints, espapers, Magaines Dictionaries umangenerated espapers ecords ideo subtitles eb eb, boos, Official rints, Old Manuscripts, espapers, Magaines eb, umangenerated eb, espapers, Magaines Total

Text encoding or marup is one of the main tass during data collection Typically, corpora consist of electronic ersions of texts taen from arious sources Therefore, a confusion may arise due to different codes used for marup Since the s, the Standard eneralied Marup anguage SM has become increasingly accepted as a standard ay of encoding electronic texts sing SM is considered as a basis for data preparation it facilitates the portability of corpora, enabling them to be reused in different contexts on different euipment, thus saing the cost of repeated typesetting Since SM can be complex for some corpora builders and users, an Extensible Marup anguage M that as deried from SM contains a limited feature set to mae it simpler to use mong the different M standards, the formation of the Text Encoding nitiatie TEb has to be seen as the first, the most comprehensie, and mature encoding standard for machinereadable texts Since its launch in , the TE has become the reference platform for the representation of machinereadable texts TE aims to encode as

b httpteicorg Imad Zeroual et al. / Procedia Computer Science 127 (2018) 82–91 85 – –

ith respect to the phenomena under inestigation f there are no specific criteria, the corpus should be many possile iews e.., tet components as physical, typoraphical or linistic oects of a tet as possile. he designed for a general use to suit most applications. In the 90’s, primarily with the completion of the ritish I , which was the crrent eition at the time this article was first smitte, was release on oemer , 00. ational orpus , basic guidelines in corpus design and compilation hae been set by releant specialists – ater, expanded these guidelines in ten fundamental criteria hich are considered as core principles such as representatieness, balance, and homogeneity ased on these design criteria, corpora builders should identify the data source genres and sie The selection and he earlier corpora that occrre efore the 90s are calle preelectronic ecase they were not compterie. finding suitable resources is much more complicated or example, as stressed by , the components of he roots of eelopin a corps can e trace ac to 9 when the erman linist assemle a lare general corpora typically are representatie of arious genres, hilst specialied corpora can be limited to highlight corps that contains million wors . ase on this corps, col plish the first nown wor only one genre or a family of genres egarding the sie and balance, claims that it is not alays possible freency list sin wor contin. In oin so, he neee the help of oer fie thosan assistants oer a perio of to collect data in similar or een sufficient uantities for each text category represented in the corpus this is often years to process the corps , 9. oweer, the first corps to e ien that name was the rown orps 0, , the case ith historical corpora oeer, seeral sources can be used to build corpora corpus, in one hand, may eelope an release in 9. It consists of tets amontin to oer one million toens after compilin a sinle consist of a single boo lie the one deeloped by to build an ontology of pulmonary diseases, year of plication 9. and the one used for commonsense noledge enhanced embeddings to sole pronoun disambiguation problems nerstanaly enoh, ilin corpora is timeconsmin an often ifficlt to nertae rearin the sie of On the other hand, corpora are usually deeloped using a number of boos eg, Shamela , or editions of a ata an the aopte compilation methos. hroh this srey, it was possile to ather information aot the particular nespaper eg, ecently, corpora builders, in particular indiiduals, use the eb to build a ery methos se to il 00 corpora in relation to their sie an the aerae time consme rin the compilation large corpora in a short time and ith lo cost oeer, e must be cautious hen draing data from the procere. he prpose of this sty is not to etermine the est way to il a corps. Instea, we wol lie to eb, especially if the aim is to build a balanced corpus in hich the language data hae to be dran from a ide now what releant methos are se an how lon they too to complete the procere. ale ehiits the range of sources otaine reslts. egarding our surey, e proposed ellnon sources boos, eb, magaines , offering the possibility to select multiple choices or adding ne sources that are not listed The results are presented in Table ale . ompilation methos of corpora

Table The used sources to build corpora ompilation methos orpora sie of toens of corpora erae time consme

00 years Sources of corpora umber of corpora years months boos boos, Official rints, espapers, Magaines 0 years months Dictionaries anal 0 years

umangenerated 00 years espapers 00 years ecords n years ideo subtitles eb 00 years

eb, boos, Official rints, Old Manuscripts, espapers, Magaines years eb, umangenerated 0 years months eb, espapers, Magaines 0 years Total emiatomate 0 00 years

Text encoding or marup is one of the main tass during data collection Typically, corpora consist of electronic 00 0 years ersions of texts taen from arious sources Therefore, a confusion may arise due to different codes used for marup Since the s, the Standard eneralied Marup anguage SM has become increasingly accepted as n years a standard ay of encoding electronic texts sing SM is considered as a basis for data preparation it facilitates n 0 years months the portability of corpora, enabling them to be reused in different contexts on different euipment, thus saing the 0 years cost of repeated typesetting Since SM can be complex for some corpora builders and users, an Extensible Marup anguage M that as deried from SM contains a limited feature set to mae it simpler to use 0 years b mong the different M standards, the formation of the Text Encoding nitiatie TE has to be seen as the first, 00 years months the most comprehensie, and mature encoding standard for machinereadable texts Since its launch in , the tomate TE has become the reference platform for the representation of machinereadable texts TE aims to encode as 00 years 0 months n years months

n 9 years b httpteicorg 86 Imad Zeroual et al. / Procedia Computer Science 127– (2018) 82–91

00 years 9 months

months rowsorcin 0 years

n ince 00

amifie approach 00 ays

ale ehiits the smmary of compilation methos mentione in the srey. e can osere that corpora ilers sally rely on three maor methos he first an the olest metho eer, the manal metho. he ineitale reslt after the appearance of compters, the atomatic metho. his latter is not completely accrate, ecase the crawle ata are often plicate on the e an nee to e cleane, filtere, an conerte into the riht format. herefore, a manal eition is sseently performe. his is the semiatomatic metho. oncernin the crowsorcin metho, it is catchin p. he crowsorcin was the reslt of the aancement of iital technoloy an the Internet in the mi000s. It is mainly an online sorcin moel in which iniials or oraniations se contritions from online commnities for prolem solin an resorces proction . ne of the main platforms used in crowdsourcing is Amazon’s Mechanical Turkc e.., , . he last osere metho is the gamified approach or “Gamification”. It is a newcomin metho to , it was first mentione in 00 an start to e se in literatre in 00 e.., . y efinition, amification is the se of ame esin elements in nongame contexts to increase users’ motivations towards given activities . hoh amifie approaches are ase on the same stratey as crowsorcin, in this latter, the participants receie money to increase motiation whereas amification incentiies them y ettin entertaine. It can e srmise that procin a corps with a consierale sie an ariety reires years of efforts rearless of the se metho. In the contet of , the first an the most way to collect meaninfl an hih ality of ata is to hire epert linists to manally il or annotate corpora howeer, it taes time, an costs money. he srey shows that of manally ilt corpora contain less than 0 million wors an it taes more than for years to complete them whereas, of atomatically ilt corpora with a sie that aries from 00 million to illions are complete in less than the same perio. rther, the atomatic methos mare a consierale proress lately, mainly e to the recent aances in eep learnin technoloies e.., eep neral networs, especially when it comes to ealin with ery lare ata an access to information. he “Intellient ersonal ssistants” lie pple iri, oole ssistant, icrosoft ortana, maon cho, etc. are certainly sfficient eamples of the sinificant sccess achiee sin these methos. enerally, the semiatomate methos are a comination of oth manal an atomate methos, the aim is to raise the ality of ata within a feasilereasonale time an cost. In oin so, at first stae, the ata are collecte or annotate atomatically an later eite y eperts. It worth mentionin that the shortest time consme for ilin a corps is achiee y the amifie approach, this metho can e promisin, especially it comines some ey featres of the other methos. or eample, it is fast as well as an atomate metho an proies oo ality reslts yet, it is ase on the same stratey as crowsorcin, t its cost oes not scale with ata sie.

fter the ata compilation, a set of eneral analyses an tet processin are essential for scientists. enerally, corps annotation is a process eote to assin linistic featres to lanae ata 9. he annotation implies to se stanars, or at least, a wellelaorate set of proceres that is wiely accepte an freently se into common corps creation an processin tools. er the last years, sinificant efforts hae een pt into eelopin proceres for atomatic annotation an tet processin. he aim is to reatly rece the hman interention an astly epans the empirical asis. hile some corpora ilers attempt to annotate their collecte ata or processin eistin ones, many others still o not other with processin tass an they st proce the corpora in raw format. rther, one of the main

c httpswww.mtr.com Imad Zeroual et al. / Procedia Computer Science 127 (2018) 82–91 87 – –

00 years 9 months aims eing addressed in is developing new forms of annotation and improving the accurac of automatic annotation. The following tasks are some ke features of corpus annotation the are considered sophisticated as a months rowsorcin part of a asic data processing workflow and their design is more contentious. The can e performed manuall 0 years expert linguists to ensure the ualit of annotation or the ma e done automaticall to allow an extensive and n ince 00 difficult analses to e carried out. To name a few amifie approach 00 ays • emmatization The purpose of lemmatization process is to reduce surface words to their canonical form called lemma this latter is a dictionar lookup form. The lemma relates different word forms that have the same ale ehiits the smmary of compilation methos mentione in the srey. e can osere that corpora meaning . urther using lemmatization is found to e efficient in particular for information retrieval . ilers sally rely on three maor methos he first an the olest metho eer, the manal metho. he • art of speech o tagging It is a asic task in corpus linguistics and an underling condition to support man ineitale reslt after the appearance of compters, the atomatic metho. his latter is not completely accrate, suseuent applications. It aims to assign a morphosntactical features to each word in a sentence ecase the crawle ata are often plicate on the e an nee to e cleane, filtere, an conerte into the according to the context also it allows simple sntactic searches to e performed . riht format. herefore, a manal eition is sseently performe. his is the semiatomatic metho. oncernin • arsing The natural successor to o tagging is parsing. asicall it provides a dependenc tree as an output. the crowsorcin metho, it is catchin p. he crowsorcin was the reslt of the aancement of iital ere the goal is predicting for each sentence or clause an astract representation of the grammatical entities and technoloy an the Internet in the mi000s. It is mainly an online sorcin moel in which iniials or the relations etween them. onseuentl the parser assigns a full laelled sntactic tree or racketing of oraniations se contritions from online commnities for prolem solin an resorces proction . ne of constituents to sentences of the corpus . the main platforms used in crowdsourcing is Amazon’s Mechanical Turkc e.., , . he last osere metho According to the surve Tale presents the different annotation processes performed during corpora uilding. is the gamified approach or “Gamification”. It is a newcomin metho to , it was first mentione in 00 ote that man corpora ma contain parts of oth annotated texts and raw data. an start to e se in literatre in 00 e.., . y efinition, amification is the se of ame esin elements in nongame contexts to increase users’ motivations towards given activities . hoh amifie Tale . Tpes of corpora annotation approaches are ase on the same stratey as crowsorcin, in this latter, the participants receie money to increase motiation whereas amification incentiies them y ettin entertaine. It can e srmise that procin a corps with a consierale sie an ariety reires years of efforts aw texts rearless of the se metho. In the contet of , the first an the most way to collect meaninfl an hih Texts with metadata ality of ata is to hire epert linists to manally il or annotate corpora howeer, it taes time, an costs artiall annotated money. he srey shows that of manally ilt corpora contain less than 0 million wors an it taes more than for years to complete them whereas, of atomatically ilt corpora with a sie that aries from 00 ull annotation million to illions are complete in less than the same perio. rther, the atomatic methos mare a consierale arsed proress lately, mainly e to the recent aances in eep learnin technoloies e.., eep neral networs, emantic annotation especially when it comes to ealin with ery lare ata an access to information. he “Intellient ersonal arts of aw texts and Texts with metadata ssistants” lie pple iri, oole ssistant, icrosoft ortana, maon cho, etc. are certainly sfficient arts of aw texts and ull annotation eamples of the sinificant sccess achiee sin these methos. enerally, the semiatomate methos are a arts of Texts with metadata and artiall annotated comination of oth manal an atomate methos, the aim is to raise the ality of ata within a feasilereasonale time an cost. In oin so, at first stae, the ata are collecte or annotate atomatically an arts of Texts with metadata and ull annotation later eite y eperts. It worth mentionin that the shortest time consme for ilin a corps is achiee y the arts of arsed and artiall annotated texts amifie approach, this metho can e promisin, especially it comines some ey featres of the other methos. arts of arsed and ull o annotated texts or eample, it is fast as well as an atomate metho an proies oo ality reslts yet, it is ase on the same Total stratey as crowsorcin, t its cost oes not scale with ata sie. As seen in Tale of corpora contain texts with metadata and with the same rate we have corpora with full annotation which means the contain at least lemmatized and o tagged data. A small rate corpora has een totall or partiall parsed and onl one corpus has een semanticall annotated. These results are expected fter the ata compilation, a set of eneral analyses an tet processin are essential for scientists. knowing that parsing and sentiment analsis reuire an advanced level knowledge of linguistic theories and rules. enerally, corps annotation is a process eote to assin linistic featres to lanae ata 9. he annotation owever automatic sentiment analsis has een increased in recent ears due to the rapid data growth in social implies to se stanars, or at least, a wellelaorate set of proceres that is wiely accepte an freently se networks especiall twitter. into common corps creation an processin tools. er the last years, sinificant efforts hae een pt into egarding the data analsis the are proal the most important procedures in data lifeccle that could help to eelopin proceres for atomatic annotation an tet processin. he aim is to reatly rece the hman solve issues support theories oserve phenomena extract valuale information and produce useful and reliale interention an astly epans the empirical asis. applications in academic commercial and pulic sectors. ext we briefly state the basis analysis used in : hile some corpora ilers attempt to annotate their collecte ata or processin eistin ones, many others still o not other with processin tass an they st proce the corpora in raw format. rther, one of the main • oncordance It aims to search the text and finds ever occurrence of each word and displas all the occurrences with its immediate context in a corpus. urther oncordances can e produced in several formats ut the most usual form is the e ord in ontext I concordance . The first concordance completed in was c httpswww.mtr.com that of ugo of aro ominican monk and cardinal died . It has een said that monks to have 88 Imad Zeroual et al. / Procedia Computer Science 127 (2018) 82–91 – –

een engaged upon its preparation. As the asis of their work the used the text of the atin ulgate the standard ner ilsn rus linguistis inurgh inurgh niersit ress ile of the Middle Ages in estern urope. inlair reliinar reenatins n rus tlg ailale ttwww l i nr trustrust tl • ord freuenc procedure It aims to produce lists of words and their freuenc in the corpus. Moreover it is ier nra een rus linguistis nestigating language struture an use arige niersit ress feasile to produce word freuenc lists using a o tagged corpus not merel their orthographic status. or inlair rus an tetasi riniles e inguist rra uie rat – example eech et al use the tagged ritish ational orpus to uild the word freuencies list in written and eling t rus linguistis an internatinal han alter e ruter ane harlet aulent uiling an ntlg ulnar iseases with natural language ressing tls using tetual spoken nglish. rra nt e n – • ollocation statistics It is a procedure to calculate statistical information aout the association the strength of iua iang inga hu ei ua nsense nwlege nhane eings r ling rnun collocation and the comparative freuencies of word forms in a corpus or multiple corpora. nlike the previous isaiguatin rles in ingra hea hallenge ri rer ri elin agiw an hian el haela argeale istrial rai rus ri rer analtical procedures collocation reuires a igsized corpus. therwise the analst cannot cope with the ri availale data. aauri ies uli aehe ei runa uiri aghuani rai reean art In addition deeper analsis lead to produce applications that help performing sophisticated tasks effectivel a e as a rus ing en the ngra n ussian uer hl in nratin etrieal – ringer that no individual or using knowledgeased methods can do etter not to mention that the could reuire ears or ar he et ning nitiatie ears auulate wis an its tential r a right uture n anguage decades of efforts. As examples of those applications Information etrieval I summarization and Machine ehnlgies igital uanities Translation MT. appling I we can search and retrieve relevant information reuired among enormous data hrshee lhai siri eeling tewritten rai rus with ultints n reeings the e.g. search engines even if the source documents are written in a language while the user’s ueries are in nternatinal rsh n ultilingual ngers he istr an riniles aular ntrl as it aets in general an nglish in artiular he ist another one . ummarization analsis can turn large numers of documents into shorter versions without i changing meaning of the original texts. This is useful for a uestion answering sstem whereas the summarization enne n intrutin t rus linguistis utlege generates a sort of summar of the answer primaril if the uestion is wh or how tpe uestion e.g. . n the ranis uera reuen analsis nglish usage other hand the enefit from online education is still limited language arriers since the content is usuall unstn rus inguistis istrial eelent nl l inguist raha rwsuring ile nline irar generated in nglish. Machine translation can ridge the language gap providing an initial translation which can la rushwit reating language resures r unerresure languages ethlgies an eerients with e later postedited translators e.g. . rai ang esur al – aernal ureh hih arguent is re nining naling an reiting niningness e arguents using iiretinal n reeings the th nnual eeting the ssiatin r utatinal inguistis 3. Conclusion erual l ah ahuaa aiiatin r rai atural anguage ressing eas int ratie rans ah earn rti ntell In this paper we have riefl presented distinct stages in the lifeccle of data in . The goal of this overview Lund, L., O’Regan, P.: Gamifying Natural Language uisitin uantitatie stu n weish antns while eaining the eets nsensus rien rewars is to help readers to etter understand the data design collection annotation and analsis procedures taking an iaee ee uiling a sentient rus using a gaiie raewr n uani antehnlg nratin experiment from alread pulished corpora of the most prominent proects of resource rich–poor languages. In order ehnlg uniatin an ntrl nirnent an anageent nternatinal nerene n – to make the surve rich in information we targeted wellknown corpora in the literature and those pulicl availale et we covered man languages in purpose to ensure that the surve is alanced as much as possile. etering in hale ae r gae esign eleents t gaeulness eining gaiiatin n reeings the th internatinal aaei inre nerene nisining uture eia enirnents – inall this overview is alwas a suect to make further progress and more detailed descriptions and to a critical arsie eeh ner rus anntatin linguisti inratin r uter tet rra alr ranis understanding of the conceptual technical and practical scope of data in light of . ttia an enaith ellish itinar r rai n letrni leigrah in the st entur thining utsie the aer reeings the ee nerene ter allinn stnia – erual ahuaa rai inratin retrieal teing r leatiatin n ntelligent stes an uter isin – e r References erual ahuaa elahi wars a stanar art eeh tagset r the rai language ing au ni ut n i – . Armstrong . hurch . Isaelle . Manzi . Tzoukermann . arowsk . atural language processing using ver large corpora. sarat eah ler ire arsing rhlgiall rih languages ntrutin t the seial issue ut pringer cience usiness Media . inguist – . u . others Introducing corpusased translation studies. pringer . rile nraning ile nline irar . efever . oste . emeval task rosslingual word sense disamiguation. roc emval. – . eeh asn thers r reuenies in written an sen nglish ase n the ritish atinal rus utlege . i . orascu . la M. Giannakopoulos G. Multidocument multilingual summarization corpus preparation part Araic rt etler trhan earh engines nratin retrieal in ratie isnesle eaing english greek chinese romanian. resented at the . hattahara al arar uer ranslatin r rssanguage nratin etrieal using ultilingual r lusters . ing . ong .. hao .. eal A... chmaltz M. u . ntaxtree aligner A weased parallel tree alignment toolkit. In avelet Analsis and attern ecognition IA International onference on. pp. –. I . aa alath .K., Lasantha, W.A.N., Premachandra, K., Nugaliyadde, A., Mallawarachchi, Y.: “Mahoshadha”, the Sinhala . othman . ingland . adford . Murph T. urran .. earning multilingual named entit recognition from ikipedia. Artif. agge rus ase uestin nswering ste n reeings irst nternatinal nerene n nratin an uniatin Intell. – . ehnlg r ntelligent stes lue – ringer . inclair . Intuition and annotation–the discussion continues. ang. omput. – . ansen lala uan ara sustainale glal slutin r aessiilit were unities lunteers n . allida M. Matthiessen .M. Matthiessen . An introduction to functional grammar. outledge . nternatinal nerene n niersal ess in uanuter nteratin – ringer . Milfull I. Mutual Illumination The ictionar of ld nglish and the ngoing evision of the xford nglish ictionar . lorilegium. – . . ation I... Teaching learning vocaular. oston einle engage earning . . irscherg . Manning .. Advances in natural language processing. cience. – . Appendix: Corpora of the survey . Manning .. artofspeech tagging from to is it time for some linguistics In International onference on Intelligent Text rocessing and omputational inguistics. pp. –. pringer . Corpora language(s) Corpus reference (URL or DOI) . eech G. orpora and theories of linguistic performance. ir. orpus inguist. – . laat rai rus rai httatalgelrainrutinhrutsi . all .. Automated text analsis autionar tales. it. inguist. omput. – . lin reean uth htturletrugnlannrtrees . eech G. million words of nglish the ritish ational orpus . ang. es. – . ara languages httaltrirgresureserus rauis rai nglish an renh httwwwalwerganthlg Imad Zeroual et al. / Procedia Computer Science 127 (2018) 82–91 89 – –

een engaged upon its preparation. As the asis of their work the used the text of the atin ulgate the standard ner ilsn rus linguistis inurgh inurgh niersit ress ile of the Middle Ages in estern urope. inlair reliinar reenatins n rus tlg ailale ttwww l i nr trustrust tl • ord freuenc procedure It aims to produce lists of words and their freuenc in the corpus. Moreover it is ier nra een rus linguistis nestigating language struture an use arige niersit ress feasile to produce word freuenc lists using a o tagged corpus not merel their orthographic status. or inlair rus an tetasi riniles e inguist rra uie rat – example eech et al use the tagged ritish ational orpus to uild the word freuencies list in written and eling t rus linguistis an internatinal han alter e ruter ane harlet aulent uiling an ntlg ulnar iseases with natural language ressing tls using tetual spoken nglish. rra nt e n – • ollocation statistics It is a procedure to calculate statistical information aout the association the strength of iua iang inga hu ei ua nsense nwlege nhane eings r ling rnun collocation and the comparative freuencies of word forms in a corpus or multiple corpora. nlike the previous isaiguatin rles in ingra hea hallenge ri rer ri elin agiw an hian el haela argeale istrial rai rus ri rer analtical procedures collocation reuires a igsized corpus. therwise the analst cannot cope with the ri availale data. aauri ies uli aehe ei runa uiri aghuani rai reean art In addition deeper analsis lead to produce applications that help performing sophisticated tasks effectivel a e as a rus ing en the ngra n ussian uer hl in nratin etrieal – ringer that no individual or using knowledgeased methods can do etter not to mention that the could reuire ears or ar he et ning nitiatie ears auulate wis an its tential r a right uture n anguage decades of efforts. As examples of those applications Information etrieval I summarization and Machine ehnlgies igital uanities Translation MT. appling I we can search and retrieve relevant information reuired among enormous data hrshee lhai siri eeling tewritten rai rus with ultints n reeings the e.g. search engines even if the source documents are written in a language while the user’s ueries are in nternatinal rsh n ultilingual ngers he istr an riniles aular ntrl as it aets in general an nglish in artiular he ist another one . ummarization analsis can turn large numers of documents into shorter versions without i changing meaning of the original texts. This is useful for a uestion answering sstem whereas the summarization enne n intrutin t rus linguistis utlege generates a sort of summar of the answer primaril if the uestion is wh or how tpe uestion e.g. . n the ranis uera reuen analsis nglish usage other hand the enefit from online education is still limited language arriers since the content is usuall unstn rus inguistis istrial eelent nl l inguist raha rwsuring ile nline irar generated in nglish. Machine translation can ridge the language gap providing an initial translation which can la rushwit reating language resures r unerresure languages ethlgies an eerients with e later postedited translators e.g. . rai ang esur al – aernal ureh hih arguent is re nining naling an reiting niningness e arguents using iiretinal n reeings the th nnual eeting the ssiatin r utatinal inguistis 3. Conclusion erual l ah ahuaa aiiatin r rai atural anguage ressing eas int ratie rans ah earn rti ntell In this paper we have riefl presented distinct stages in the lifeccle of data in . The goal of this overview Lund, L., O’Regan, P.: Gamifying Natural Language uisitin uantitatie stu n weish antns while eaining the eets nsensus rien rewars is to help readers to etter understand the data design collection annotation and analsis procedures taking an iaee ee uiling a sentient rus using a gaiie raewr n uani antehnlg nratin experiment from alread pulished corpora of the most prominent proects of resource rich–poor languages. In order ehnlg uniatin an ntrl nirnent an anageent nternatinal nerene n – to make the surve rich in information we targeted wellknown corpora in the literature and those pulicl availale et we covered man languages in purpose to ensure that the surve is alanced as much as possile. etering in hale ae r gae esign eleents t gaeulness eining gaiiatin n reeings the th internatinal aaei inre nerene nisining uture eia enirnents – inall this overview is alwas a suect to make further progress and more detailed descriptions and to a critical arsie eeh ner rus anntatin linguisti inratin r uter tet rra alr ranis understanding of the conceptual technical and practical scope of data in light of . ttia an enaith ellish itinar r rai n letrni leigrah in the st entur thining utsie the aer reeings the ee nerene ter allinn stnia – erual ahuaa rai inratin retrieal teing r leatiatin n ntelligent stes an uter isin – e r References erual ahuaa elahi wars a stanar art eeh tagset r the rai language ing au ni ut n i – . Armstrong . hurch . Isaelle . Manzi . Tzoukermann . arowsk . atural language processing using ver large corpora. sarat eah ler ire arsing rhlgiall rih languages ntrutin t the seial issue ut pringer cience usiness Media . inguist – . u . others Introducing corpusased translation studies. pringer . rile nraning ile nline irar . efever . oste . emeval task rosslingual word sense disamiguation. roc emval. – . eeh asn thers r reuenies in written an sen nglish ase n the ritish atinal rus utlege . i . orascu . la M. Giannakopoulos G. Multidocument multilingual summarization corpus preparation part Araic rt etler trhan earh engines nratin retrieal in ratie isnesle eaing english greek chinese romanian. resented at the . hattahara al arar uer ranslatin r rssanguage nratin etrieal using ultilingual r lusters . ing . ong .. hao .. eal A... chmaltz M. u . ntaxtree aligner A weased parallel tree alignment toolkit. In avelet Analsis and attern ecognition IA International onference on. pp. –. I . aa alath .K., Lasantha, W.A.N., Premachandra, K., Nugaliyadde, A., Mallawarachchi, Y.: “Mahoshadha”, the Sinhala . othman . ingland . adford . Murph T. urran .. earning multilingual named entit recognition from ikipedia. Artif. agge rus ase uestin nswering ste n reeings irst nternatinal nerene n nratin an uniatin Intell. – . ehnlg r ntelligent stes lue – ringer . inclair . Intuition and annotation–the discussion continues. ang. omput. – . ansen lala uan ara sustainale glal slutin r aessiilit were unities lunteers n . allida M. Matthiessen .M. Matthiessen . An introduction to functional grammar. outledge . nternatinal nerene n niersal ess in uanuter nteratin – ringer . Milfull I. Mutual Illumination The ictionar of ld nglish and the ngoing evision of the xford nglish ictionar . lorilegium. – . . ation I... Teaching learning vocaular. oston einle engage earning . . irscherg . Manning .. Advances in natural language processing. cience. – . Appendix: Corpora of the survey . Manning .. artofspeech tagging from to is it time for some linguistics In International onference on Intelligent Text rocessing and omputational inguistics. pp. –. pringer . Corpora language(s) Corpus reference (URL or DOI) . eech G. orpora and theories of linguistic performance. ir. orpus inguist. – . laat rai rus rai httatalgelrainrutinhrutsi . all .. Automated text analsis autionar tales. it. inguist. omput. – . lin reean uth htturletrugnlannrtrees . eech G. million words of nglish the ritish ational orpus . ang. es. – . ara languages httaltrirgresureserus rauis rai nglish an renh httwwwalwerganthlg 90 Imad Zeroual et al. / Procedia Computer Science 127 (2018) 82–91 –

.dfage Araic nglish Parallel News Araic, nglish htts:catalog.ldc.uenn.eduL Araic reean Araic htts:catalog.ldc.uenn.eduL Aranea We orora languages htt:unesco.unia.sguest ARAROMANSAL nglish, rench, talian htt:catalog.elra.inforoductinfo.hroductsid ARR nglish htt:www.roects.alc.manchester.ac.uarcher arenen Araic htts:www.setchengine.co.uartentencorus A rench, nglish htt:rali.iro.umontreal.caralifrA oL taliannglish htt:corora.ficlit.unio.it OL nglish, hinese htts:catalog.ldc.uenn.eduL rown nglish htt:clu.uni.noicamerowncm.html ulreean ulgarian htt:www.ultreean.org L nglish, German, utch htts:catalog.ldc.uenn.eduLL rish, Latin, nglish, rench, Sanish, L htt:www.ucc.iecelt talian, Proenal, utch, anish MPlico Portuguese htt:www.linguateca.tMPulico NL orus Portuguese htt:cintil.ul.t onote nglish htt:www.cs.cornell.eduhomelleedataconote.html ORGA Galician htt:corus.cir.escorga ORS talian htt:corora.ficlit.unio.it ORSOS talian htt:corora.ficlit.unio.it Greenglish, ulgariannglish, htt:dl.acm.orgcitation.cfmid orora for eontent rofessionals Sloenenglish, and Seriannglish OKN orus S RNOS Sanish htts:ianladimir.githu.iositiocorusironia orus del saol Sanish htt:www.corusdelesanol.orghistgen orus del saol We Sanish htt:www.corusdelesanol.orgwedial orus do Portugus Portuguese htt:www.corusdoortugues.orghistgen orus do Portugus We Portuguese htt:www.corusdoortugues.orgwedial orus of Soen Lithuanian Lithuanian htt:donelaitis.du.ltsaytinesalostestynas RAR nglish, rench, Sanish htt:catalog.elra.inforoductinfo.hroductsid RA Sanish corus.rae.escreanet.html roatian National orus roatian . hinese, nglish, Gree, Polish, and aniel corus .... Russian eReKo German htt:www.idsmannheim.delroetedereoi.html deWa German htt:wacy.sslmit.unio.it iaORS talian htt:corora.ficlit.unio.it MA orus languages htt:ous.lingfil.uu.seMA.h nglish Gigaword nglish htts:catalog.ldc.uenn.eduldct uroarl uroean languages htt:www.statmt.orgeuroarl rantet rench htt:www.frantet.fr frWa rench htt:wacy.sslmit.unio.it GeRePa German, and rench htt:catalog.elra.inforoductinfo.hroductsid GNA nglish htt:www.nactem.ac.umetanowledgedownload.h Gloal nglish Monitor orus nglish htt:www.corus.ham.ac.ucclgloal.htm GM Georgtown niersity nglish htts:corling.uis.georgetown.edugum Multilayer corus elsini orus nglish htt:www.helsini.fiariengoRcororaelsiniorus G nglish htt:www.ucl.ac.uenglishusageroectsiceg nternational orus of nglish nglish htt:www.ucl.ac.uenglishusageroectsice.htm nternational orus of Learner htt:www.fltr.ucl.ac.efltrgermetanceclecl languages nglish L Proectscleicle.htm NRS nglish, rench, German htt:arts.righton.ac.ustaffrafsalieintersect itWa talian htt:wacy.sslmit.unio.it KAS Araic .s KAah eendency reean Kaah htts:githu.comniersaleendenciesKaah Kaah Language orus Kaah htt:acorus.lcween Korean National orus Korean htt:www.seong.or.r KorusK anish htt:ordnet.dorusdensetlanguageen KSA Araic htt:sucorus.su.edu.sa Lancaster Parsed orus nglish htt:clu.uni.noicamelanes.html htt:uniersal.elra.inforoductinfo.hcPathro LancasterLeeds reean nglish ductsid LASSY utch htt:odur.let.rug.nlannoordLassy Imad Zeroual et al. / Procedia Computer Science 127 (2018) 82–91 91 – –

.dfage LA Mandarin hinese htt:www.liac.org Araic nglish Parallel News Araic, nglish htts:catalog.ldc.uenn.eduL Malay oncordance Proect Malay htt:mc.anu.edu.au htt:www.cs.cornell.edueoleaomoiereiew Araic reean Araic htts:catalog.ldc.uenn.eduL Moie Reiew ata nglish Aranea We orora languages htt:unesco.unia.sguest data Araic, hinese, ech, nglish, ARAROMANSAL nglish, rench, talian htt:catalog.elra.inforoductinfo.hroductsid MultiLing Multilingual Multi htt:multiling.iit.demoritos.gragesiewtasmms rench, Gree, erew, indi, ARR nglish htt:www.roects.alc.manchester.ac.uarcher ocument Summariation orus multidocumentsummariationdataandinformation Romanian, Sanish arenen Araic htts:www.setchengine.co.uartentencorus nglish, rench, Sanish, Araic, A rench, nglish htt:rali.iro.umontreal.caralifrA MultiN htt:www.euromatrilus.netmultiun Russian, hinese, German oL taliannglish htt:corora.ficlit.unio.it OL nglish, hinese htts:catalog.ldc.uenn.eduL Negra German htt:www.coli.unisaarland.deroectssfnegracorus rown nglish htt:clu.uni.noicamerowncm.html NMLAR orus Araic htt:www.rdieg.comProectsnemlar.htm ulreean ulgarian htt:www.ultreean.org OntoNotes nglish, hinese, Araic htts:catalog.ldc.uenn.eduL L nglish, German, utch htts:catalog.ldc.uenn.eduLL rish, Latin, nglish, rench, Sanish, PANAA nglish, rench, Gree htt:catalog.elra.inforoductinfo.hroductsid L htt:www.ucc.iecelt talian, Proenal, utch, anish Parallela nglishPolish htt:aralela.clarinl.eu MPlico Portuguese htt:www.linguateca.tMPulico Par nglish, rench, talian htts:githu.commsangartutreo NL orus Portuguese htt:cintil.ul.t Penn reean nglish htts:catalog.ldc.uenn.eduldct onote nglish htt:www.cs.cornell.eduhomelleedataconote.html Polarity nglish .NM.. ORGA Galician htt:corus.cir.escorga PPARL orus Portuguese htt:catalog.elra.inforoductinfo.hroductsid ORS talian htt:corora.ficlit.unio.it ORSOS talian htt:corora.ficlit.unio.it Russian collection Russian htt:corus.leeds.ac.uruscorora.html Greenglish, ulgariannglish, htt:dl.acm.orgcitation.cfmid Russian National orus Russian htt:www.ruscorora.ru orora for eontent rofessionals Sloenenglish, and Seriannglish OKN North Saami, South Saami, Aanaar SKOR htt:gtwe.uit.noorclangen orus S RNOS Sanish htts:ianladimir.githu.iositiocorusironia Saami, Lule Saami, Solt Saami orus del saol Sanish htt:www.corusdelesanol.orghistgen orus del saol We Sanish htt:www.corusdelesanol.orgwedial Sinica hinese htt:ci.iis.sinica.edu.twKPengersioncorus.htm orus do Portugus Portuguese htt:www.corusdoortugues.orghistgen Sloa National orus Sloa htt:orus.uls.saa.s orus do Portugus We Portuguese htt:www.corusdoortugues.orgwedial Syntactic ataase for modern orus of Soen Lithuanian Lithuanian htt:donelaitis.du.ltsaytinesalostestynas Sanish htt:www.ds.usc.es Sanish S RAR nglish, rench, Sanish htt:catalog.elra.inforoductinfo.hroductsid RA Sanish corus.rae.escreanet.html he American National orus nglish htt:www.anc.org roatian National orus roatian . hinese, nglish, Gree, Polish, and he an of nglish nglish htt:www.lingsoft.fidocengcganofnglish.html aniel corus .... Russian eReKo German htt:www.idsmannheim.delroetedereoi.html he ritish National orus nglish htt:www.natcor.o.ac.u deWa German htt:wacy.sslmit.unio.it he ech National orus ech htts:www.orus.c iaORS talian htt:corora.ficlit.unio.it MA orus languages htt:ous.lingfil.uu.seMA.h he ellenic National orus Gree htt:hnc.ils.grfind.as nglish Gigaword nglish htts:catalog.ldc.uenn.eduldct uroarl uroean languages htt:www.statmt.orgeuroarl he hungarian gigaword corus ungarian htt:corus.nytud.humnsindeeng.html rantet rench htt:www.frantet.fr frWa rench htt:wacy.sslmit.unio.it he nternational orus of Araic Araic htt:www.iale.orgica GeRePa German, and rench htt:catalog.elra.inforoductinfo.hroductsid he Lancaster orus of Mandarin htt:www.lancaster.ac.ufassroectscorusLMdefaul hinese GNA nglish htt:www.nactem.ac.umetanowledgedownload.h hinese t.htm Gloal nglish Monitor orus nglish htt:www.corus.ham.ac.ucclgloal.htm GM Georgtown niersity he National orus of Polish Polish htt:n.l nglish htts:corling.uis.georgetown.edugum Multilayer corus he New Yor imes Annotated nglish htts:catalog.ldc.uenn.eduL elsini orus nglish htt:www.helsini.fiariengoRcororaelsiniorus orus G nglish htt:www.ucl.ac.uenglishusageroectsiceg htt:www.ims.uni GR orus German nternational orus of nglish nglish htt:www.ucl.ac.uenglishusageroectsice.htm stuttgart.deforschungressourcenororatiger.html nternational orus of Learner htt:www.fltr.ucl.ac.efltrgermetanceclecl languages nglish L Proectscleicle.htm PSR omlete nglish htts:catalog.ldc.uenn.eduLA NRS nglish, rench, German htt:arts.righton.ac.ustaffrafsalieintersect htt:www.sfs.uni a treean German itWa talian htt:wacy.sslmit.unio.it tueingen.deenasclresourcescororatuead.html KAS Araic .s urish National orus urish htt:www.tnc.org.tr KAah eendency reean Kaah htts:githu.comniersaleendenciesKaah htts:www.u.tudarmstadt.dedataargumentation Kaah Language orus Kaah htt:acorus.lcween KPonArg nglish Korean National orus Korean htt:www.seong.or.r mininguconargcorus KorusK anish htt:ordnet.dorusdensetlanguageen uWa nglish htt:wacy.sslmit.unio.it KSA Araic htt:sucorus.su.edu.sa Lancaster Parsed orus nglish htt:clu.uni.noicamelanes.html Wiiedia: ataase More than languages htts:en.wiiedia.orgwiiWiiedia:ataasedownload htt:uniersal.elra.inforoductinfo.hcPathro W languages htts:wit.f.eu LancasterLeeds reean nglish ductsid WOA nglish htt:worsho.colis.orgwochat LASSY utch htt:odur.let.rug.nlannoordLassy