Tour de CLARIN CLARIN Common Language Resources and Infrastructure

Written by Maria Gavriilidou, Katerina T. Frantzi, Darja Fišer and Jakob Lenardič, and edited by Darja Fišer and Jakob Lenardič Foreword

Tour de CLARIN highlights prominent user involvement activities of CLARIN national consortia with the aim to increase the visibility of CLARIN consortia, reveal the richness of the CLARIN landscape, and display the full range of activities throughout the CLARIN network that can inform and inspire other consortia as well as show what CLARIN has to offer to researchers, teachers, students, professionals and the general public interested in using and processing language data in various forms.

This brochure presents Greece and is organized in five sections:

• Section One presents the members of the consortium and their work • Section Two demonstrates an outstanding tool • Section Three highlights a prominent resource • Section Four reports on a successful event for researchers and students • Section Five includes an interview with a renowned researcher from the digital humanities or social who has successfully used the consortium’s infrastructure in their research

Perimetriki Patron, Achaia, Greece photoFOREWORD by Jason Blackeye | Unsplash3 accompanied by automatically created metadata. Statistics related to the use of the resources (such as number of views and downloads) as well as Greece dynamic recommendations of related resources and services (such as similar resources viewed by other users) are available to all users. Organisations that are members of the clarin:el network Written by Darja Fišer, Maria Gavriilidou have the ability to set up their own and Jakob Lenardič repository within the infrastructure, which can then be accessed through the The Greek network clarin:el1 has been a member of CLARIN central inventory. Individuals who join ERIC since February 2015. It was founded by three Greek the clarin:el network may store their research institutions: resources at a dedicated repository, the so-called Hosted Resources Repository, • the Athena Research and Innovation which is also available for the storage Centre; of resources provided by organisations • the National Centre for Scientific which do not wish to maintain their own The Clarin:el Team Research Demokritos; and repository. • the Greek Research and Technology Network (GRNET S.A.). The Greek consortium also actively promotes user involvement. Recently, on 27 June 2018, the Greek consortium organised an event intended to deepen the dialogue with digital humanities and social sciences researchers, It has since expanded to a nation-wide better understand their requirements and familiarise them with the clarin:el infrastructure. On the one hand, the network currently including five event featured an interactive session where the researchers had the opportunity to present their own research and two research centres: questions and experiences with using language to clarin:el experts, while on the other, a hands-on session was organised, where the researchers were able to familiarise themselves with the clarin:el inventory and • of Athens, the use of its resources and tools. You can read more about the event on page 9. • Aristotle University of , • , • , • , • Centre for the , and • National Centre of Social Research.

The Greek consortium is coordinated by Stelios Piperidis, Head of the Department of Natural Language Processing and Language Infrastructures of the Institute for Language and Speech Processing/Athena Research and Innovation Centre.

Clarin:el primarily functions as a secure and stable national research infrastructure, offering a network of dynamic repositories devoted to and enabling the sustainable storage and dissemination of language tools and resources. Researchers can access the tools and resources of the consortium via the clarin:el inventory, which acts as a single access point to a number of local, institutional, repositories that are part of the clarin:el network. The clarin:el inventory provides user-friendly browsing and search functionalities, offering a customisable faceted search interface that allows researchers to narrow down their search queries on the basis of metadata-based features such as resource type, language, thematic domain, temporal or geographic coverage, access terms and conditions, etc.

Clarin:el currently contains around 500 language resources and 35 tools, which can be accessed by registered and non-registered users, in full compliance with the licence terms defined by the resource providers. Many of the language tools in the inventory are offered as web services for processing content in Greek as well as other languages, which means that researchers can use them to process their data directly through the inventory: users can either select resources from the clarin:el inventory to process, or they can upload their own data for processing. The outcome of the processing constitutes a new resource which can directly be added to the inventory,

1 http://www.clarin.gr/en 4 GREECE | INTRODUCTION 75

Kos, Greece | photo by Mico59 | Pixabay 6 Greece Written by Maria Gavrilidou, The GrNE-Tagger Gavrilidou, Maria by Written GrNE-Tagger 2 Use. Academic –NonCommercial under alicence that permits Athena RC,andisavailable and SpeechProcessing / by theInstitute for Language andismaintaineddeveloped GrNE-tagger hasbeen enriched withGrNE-tagger isacorpus conducted ofinterviews with female entrepreneurs inAthens. already beensuccessfully appliedto annotate several resources; for instance, onesuchresource the users receive anemailwithalinkto theresults oftheprocessing. Furthermore, thetool has (oruploadinventory theirown resource) process andthey After it. the completion oftheprocessing, the users donot needto install thetool locally; simplyselect they aresource from theclarin:el GrNE-tagger hasbeenintegrated intheclarin:elinfrastructure asawebwhichmeans service, that entity). shallow syntactic parsing, thesystem alsodisambiguates between LOCATION andGPE(geo-political Ltd. etc. (for ORGANISATION); words suchasstreet, bridge etc. (for LOCATION) andsoon.Basedon sister ofetc. (inthecase ofPERSON); prefixes orsuffixes denoting company types,suchasCorp., are, for example, professional titles,words denoting nationality orkinshipterms suchasfather of, keywords inthecontext oftheambiguousexpression. Thekeywords usedfor suchdisambiguation the correct category, GrNE-Tagger appliesanother round oflinguistic rulesthat search for specific To consolidate acandidate NEoraproper namelabelledasUNKNOWN, andto finallyplace itinto name inthepre-processed text isnot identifiedinthismanner, thetool tags itasUNKNOWN. subsequently checksthemagainst manuallycreated wordlists ofexisting proper names.Ifaproper applied to thetext intwo stages: itfirst useslinguistic rulesto identifyaset ofcandidate NEsand the text. After thepre-processing stage is completed, theNamedEntity Recognition algorithm is The annotation processes before NamedEntityRecognition constitute thepre-processing of Named EntityRecognition Tokenisation >Sentence Segmentation >Part-of-Speech Tagging >Lemmatisation >Chunking in thesensethat theoutput ofonetool constitutes theinputfor thenext: The GrNEtagger isnot asingletool, butrather apre-defined pipelineoftools seamlessly integrated, (Named Entities)inGreek texts andclassifiestheminto oneofthefollowing five category types: GREECE | http://hdl.grnet.gr/11500/ATHENA-0000-0000-23F2-7 • • • • • Figure 1:TheoutputofGrNE-tagger semantic content actually refers to its government oradministration. GPE (Geo-political entity):entitieswhosenamescoincide withalocation name,butwhose FACILITY: namesofbuildingsandother human-created structures, suchasstreets, bridges, etc.; ORGANISATION: namesofentities suchascompanies, institutions, organisations, etc.; LOCATION: political orgeographical namessuchascontinents, countries, cities,etc.; PERSON: person names,family names; (using GATE asavisualisation tool), in whichdifferent NEsare marked TOOL 2 isatool available through clarin:elthat automatically recognises proper names with different colours. edited by by edited Darja Fišer and and Fišer Darja Jakob Lenardič Jakob Figure 2:Greek parliamentary corpus intheclarin:elrepository. 3 under theCC-BY-NC licence. while thecurrent published version, namelythe HellenicParliament Sittings corpus, can already bedownloaded April 2018;sointotal, SessionsMinutes. 29years Thisversion willsoonbeavailable ofPlenary through clarin:el, language materials SessionsMinutes publishedby from theHellenicParliament allPlenary from 3July1989to 31 inspired by theparticipation oftheUniversity intheclarin:elnetwork. Thelatest version ofH-ParCo consists of The corpus forms part ofthedynamicHellenicParliamentary Corpus,H-ParCo, was whosedevelopment actually members, spanning theyears 2011–2015.Theresource hasatotal size ofapproximately 28.7millionwords. Gavriilidou, Maria Studies oftheAegean University, includesminutes ofmeetings oftheGreek Parliament andspeechesofParliamentby edited and The corpus HellenicParliament Sittings, Frantzi T. Katerina by Written Parliamentary CorpusH-ParCo The HellenicParliament Sittings andHellenic http://hdl.grnet.gr/11500/AEGEAN-0000-0000-2545-9 3 developed by developed theLaboratory ofInformatics, Department ofMediterranean Darja Fišer and and Fišer Darja GREECE | Jakob Lenardič Jakob RESOURCE 7 8 Greece GREECE | Georgalidou, M.,Frantzi, K.,andGiakoumakis, G.(2017)Addressing adversaries Γεωργαλίδου, Μ.,Φραντζή, Κ.Τ., andΓιακουμάκης, Γ. (2018).Κοινοβουλευτικός References: Discourse Analysispurposes(Georgalidou et al.2017 and2018[inGreek]). Studies andmore. Ithasbeenalready successfully usedfor Political/Critical Humanities, Communicational Techniques, Political Sciences, Sociology, Gender such asLinguistics, Political Discourse Analysis,Critical Discourse Analysis,Digital Generally, H-ParCo isaimedat researchers ofvarious domainsanddisciplines, Given that ofH-ParCo thedevelopment isanongoing process, future work involves: “sanitise” thelanguage used). have taken place to alter therecorded language (for instance, to correct errors or in thepublicisedtexts ofthemeetings; nomanualorautomatic interventions The actual language material contained intheresource isexactly what isincluded corresponding .txt filenameandthesize interms ofnumberwords for each file. the parliamentary term andsession,themeeting, theoriginalfilename, files ismaintained. Thecorpus alsocontains richmetadata that specifythedate, kept andorganised sothat aone-to-one correlation withthecorresponding .txt can beprocessed by existing clarin:eltools. Theoriginalfileshave alsobeen pertain to. All.pdfand.doc fileshave beenconverted into .txt files,sothat they been retrieved manuallyandclassifiedaccording to theyear andmonththey data inthree formats: as.pdffiles,.doc files,andas.txt files.Thedata have The collection process hasnot beenaneasy task: theHellenicParliament publishes International Conference onGreek Linguistics, Westminster 7-9September 2017. in theGreek Parliament: acorpus-based approach. BookofAbstracts ofthe13th Aristotle University ofThessaloniki,Thessaloniki19-21April2018. of the39thAnnualMeeting oftheDepartment ofLinguistics, SchoolofPhilology, λόγος, ευγένεια και επιθετικότητα στο ελληνικό κοινοβούλιο. BookofAbstracts • • • that ofH-ParCo. .pdf form. Therefore, theretrieval procedure isexpected to bethesameas Sessions Minutes oftheDemocracy ofCyprus. Inthiscase, thefilesare in ofasimilarcorpusThe development consisting of theParliament Plenary is not astraightforward process. to bealot more time-consuming, asthetask oftheirconversion to .txt files to hamperthedata retrieval andprocessing procedures, whichisexpected July 1989).Theseare mostly image files(scanned images); thisisexpected The additionofminutes ofolder Parliament Sessions(before 3 Plenary are alsoexpected to bethesame. formats astheprevious ones,sotheretrieval andprocessing procedures (from 31April2018 onwards). Thesefilesare expected to beofthesame The continuous additionofminutes ofrecent Parliament Sessions Plenary RESOURCE political sciences. highlighting thekey role oflanguage infacilitating technology research inthedomainof socialand research questions. These presentations reinforced theinterconnection between thetwo scientificareas, researchframework ofmultidisciplinary projects, andshowed can how they beusedto tackle qualitative In turn,ILSP researchers presented specificlanguage applications technology inthe developed interactive session,whichhasbeenrecorded onvideo. follow,they andthetools anddata processing techniques are they familiar with.Thistook place inalively given theopportunityto present theirresearch questions, thecurrent theway work, methodologies they hand, andlanguage ontheother), technology the15researchers ofEKKEandPanteion University were Aiming at themutual acquaintance ofthetwo scientificareas (socialandpolitical sciences onthe one focus-group workshop. the Panteion University (whichare memberinstitutions thetwo new oftheGreek network) CLARIN for a the National Centre for SocialResearch (EKKE)andtheDepartment ofPolitical andHistory of ILSP, asthecoordinator whichserves ofclarin:el,invited prominent SocialandPolitical Scientists from at theInstitute for Language andSpeechProcessing (ILSP/"Athena" R.C.). organised onLanguage anevent Data andTechnologies inSocialandPolitical Sciences, whichtook place On 27June2018,theGreek Infrastructure for language resources, technologies clarin:el andservices Gavriilidou, Maria by Written Political Sciences Language Data andTechnologies inSocialand edited by by edited Darja Fišer and and Fišer Darja Jakob Lenardič Jakob GREECE | EVENT 9 10 Greece upload theirresources. the researchers were shown how to set uptheirrepositories, prepare relevant documentation and memberinstitutionstwo new andasaglimpseoftheirtasks asnodesofthenetwork. Additionally, ofresources,the inventory both andthetools asanintroduction andweb served services) for the infrastructure. Adetailed presentation ofclarin:el(whichfocused ontheportal, therepositories, The second familiarised halfoftheevent thesocialandpolitical scientists withtheclarin:el language resource infrastructures: The discussionsessionthat revealed closedtheevent someinteresting points asregards theuseof familiarise andtheuseofits themselves withtheclarin:elinventory resources andtools. a “guided ofallthefeatures tour” oftheinfrastructure and,through guidedexercises, were ableto The presentation was followed training by ahands-on session,where theresearchers were given GREECE | • • • also highlighted asacrucialactivity for language resource infrastructures. Finally, thepromotion ofa“sharing culture” andofopenlanguage data andtools was contributors andconsumers from illegal useoftheirdata. etc.) was considered significant to asregards bevery theprotection oflanguage resource of IPR,copyright issues,distribution issues,standardisation oflicensing procedures, The role oflanguage resource infrastructures interms ofcertain legal issues(clearance their own resources, orto store resources inthe repository. Lower intheirmotivation for usingtheclarin:elinfrastructure was theincentive to share participants) was access to tools andweb followed services, by access to other resources. The most crucialfactor for theuptake oftheclarin:elinfrastructure (asexpressed by the EVENT 4 outburst oftheeconomic crisisin2009.Drawing onavast amountofdata from arichvariety ofsources and of severe economic crisis. Inlinewiththisperception, xenophobia shouldhave increased inGreece after the common perception that xenophobia isadeep-rooted socialphenomenon that escalates undercircumstances scale multi-source study based ontheuseofadvanced computational socialscience approaches. There isa The basic aimofthisresearch effort was to examine thephenomenonof xenophobia inGreece through alarge- 3. Couldyou describetheXENO@GRproject inmore detail? consortium andwas alsoourresearch partner intheXENO@GRproject. XENO@GR andthecollaboration withtheresearch team ofILSP/ATHENA, whichcoordinates theclarin:el in order to conduct top-level research intheirrespective fields.Ibecame involved inclarin:elduringourproject across Europe, andIwas aware ofthefacilities that research infrastructures provide to thescientificcommunity in So.Da.Net network, whichisanother research infrastructure that bringstogether socialscience data archives researchers ofthehumanitiesandsocialsciences working withlanguage related material. Iamalsoinvolved aboutthisEuropeanI knew research network, since isoneofthe most CLARIN relevant infrastructures for 2. How didyou hear aboutCLARIN, andhow didyou get involved? by conducted was and of e-mail via University place Panteion took the at interview Science following Political The of Gavriilidou, Sciences. Professor Maria Political Associate and is Social Georgiadou Vassiliki Vassiliki Georgiadou http://xenophobia.ilsp.gr/?lang=el edited by by edited Darja Fišer and and Fišer Darja of Economics (LSE)-Hellenic Grants). Observatory expressions ofviolence inGreece from 2008onwards (London School investigator ofaresearch project that examines thedifferent computational socialsciences approach; currently, Iamco- xenophobia inGreece duringtheeconomic crisis,based ona investigator oftheXENO@GRresearch project, populism, radicalism andpolitical extremism. Iwas principal My research interests focus onpolitical behaviour, far rightparties, combat discrimination andracism. Greece, whichhasalready started planninganational strategy to member oftheNational Council against Racism andIntolerance in andcomparativesociology politics.Since April2016,Ihave beena of Panteion University, whichoperates asalaboratory ofpolitical member oftheSteering Committee oftheCentre for Political Research Philosophy oftheWestphalian WilhelmsUniversity ofMünster. Iama (Institute ofPolitical Science) andIholdaPhDfrom theFaculty of studied political science inAthens (Panteion) andinMünster, Germany universities, andthefirst schoolofpolitical sciences inGreece. I and Political Sciences. Ouruniversity isoneoftheoldest Greek of Political Science andHistory ofPanteion University ofSocial I amAssociate Professor ofPolitical Science at theDepartment your current work? 1. Couldyou tell usalittle bitaboutyourself, your background and Jakob Lenardič. Jakob GREECE | 4 whichexamines INTERVIEW 11 12 Greece GREECE | corroboration. Clarin:eloffers anumberofdata analysis tools, likeevent sentimentanalysis and comply withmethodological standards that willfacilitate their replication orreproduction and standards ofdata anddata protection. Inaddition,data collection andanalysismust tools for analysingempirical data, whilebeingaffiliated withthe ethical and technical rulesand researcher every With CLARIN, hastheopportunityto uselanguage resources andcomputational communication between researchers andadiscussionplatform for theacademic community. other infrastructure) repository (andevery CLARIN must beconsidered asaninnovative tool of and resources provided by theGreek consortium that you usedinyour work? 5. How hasCLARIN influenced your way of working? Wouldyou like to singleoutanytools analysis ofthefocal socialinteractions intheGreek society. in aknowledge network soasto promote theexamination and the analytical Thefindingswere levels. coded andthenstored and language experts) technology at both theconceptual and was astep by step collaboration ofboth teams (socialscientists sentiment categories ofverbal aggressiveness. Thisprocedure first-step categorisation ofTwitter data that resulted indifferent on buildingacodebook for theevents’ description andasimilar and sentimentanalysisfor thesocialmediadata. Wecollaborated we decidedto analysisfor useevent thenewspaper data we had Language experts technology explained allthepossibilitiesand questions. Ournext step was to decidehow to analysethem. we selected thosethat could helpusanswer ourresearch We were confronted withdifferent andrichdata sources and text,every data soundorvideointheworld isanew source. by thepotentialities we hadinourhands.Weunderstood that In ourfirst contact withlanguage technologies we were impressed language technologies (inyour case, theGreek CLARIN consortium)? humanities researchers collaborate withexperts from aresearch infrastructure offering 4. Onthebasis of your work withinthisproject, could you explain how socialsciences and databasethe event intheproject. Figure 3:Theworkflow for creating following research goals: exploiting awealth ofresearch instruments, perceptions andconsciousness ofGreeks. change, withreference to xenophobia asasocialphenomenondeeplyrooted inthe the“others”, inorder to examine theexpressions ofcontinuity aswell asthepossibilityof c. to decompose theeffect oftheeconomic crisisonthebehaviourofGreek peopleagainst behaviourofGreeks against anykindof“others”; and b. to examine whether therecent economic crisishasraised thexenophobic sentiments and onwards; a. to study thehistorical ofthephenomenonxenophobia evolution inGreece from the1990s INTERVIEW social interaction inGreek society. facilitating theexploration andfurther analysisofsuch were captured andcoded inaknowledge network as well asthesentiments andemotions expressed, organisations andlocations) involved intheseevents, timespan ofthelast twenty years. Allentities(people, the phenomenonunderstudy andhappened inthe database capturing events whichwere related to To theabove achieve goals, we created alarge event we tried to test thevalidity ofthis,addressing the XENO@GR project. newspapers duringthe on theanalysisofGreek ethnic background, based Greece interms ofnational/ of verbal aggressiveness in Figure 4:Themaintargets infrastructure. and that ourstudents, researchers andcolleagues willtake advantage oftheopportunitiesoffered by the I hopethat more members ofourresearch community at Panteion University willget involved inclarin:el, 9. How would you envisage future collaboration of your university withCLARIN? session, wherehands-on we were trained ontheuseofclarin:elinfrastructure. clarin:el. For us,associalscientists, itwas particularly useful that thepresentations were followed by anintensive contents, aswell astechnical andlegal issuesregarding theuseandsharingofresources that are available in Papageorgiou, MariaPontiki, Stelios Piperidis)elaborated theclarin:el infrastructure, architecture, andthe opportunities offered by theinfrastructure. The keynote talk (MariaGavriilidou)andthepresentations (Xaris to beinformed ofwhat ourparticipation inclarin:elcould bring,how to get involved andtake advantage ofthe clarin:el in2017and2018,respectively. For members ofthe infrastructure, us,asnew itwas extremely important in Greece, notably researchers from Panteion University andtheNational Centre for SocialResearch that joined wasThe purposeofthisevent to bringtogether clarin:elwithmembers ofSocialSciences research community researchers (presented onpage 9)?How are suchevents valuable to your research community? 8. Couldyou share your experience withtherecent clarin:elevent for socialsciences andhumanities availability, accessibility andsustainability. we can have access to theresources ofothers. Thisprovides synergies amongresearchers andimproves data data through theuniversity data repository, thusenablingother researchers to work withourresources, and social scientists. Havingjoinedthenetwork asalegal entity(Panteion University), we can upload andshare our The clarin:elnetwork facilitates access to language data sources that are extremely relevant inourresearch as repository? 7. Didthisexperience influence your decisionto jointheclarin:elnetwork andset upyour own LRs step for thesocialsciences ingeneral anditexpands research opportunitiesaswell. minimum amountoftime,andthusmake more concrete andgrounded inferences. big that Ibelieve thisisavery also for thestakeholders. Language technologies give ustheopportunityto explore large amounts ofdata inthe collaboration brings.Welive inaworld withmultipledata sources that can bevaluable not onlyfor science, but obtained perspective anew for ourresearch andbecame familiar withthepotential that thisinterdisciplinary choose theappropriate operational definitionsfor ourstudy; butassoonthisbecame common ground, we based ontechnology. At thebeginning we hadto adjust to that theterminology thetwo teams used,inorder to society. However, itisnot aneasy task to findacommon communication code withadisciplinethat ismostly disciplines. Through interdisciplinarity, that Ibelieve academic research can become more robust andusefulto From asasocialscientist, mypointofview itisquite challengingto collaborate withcolleagues from other political scientists? methodology? Doyou think that usinglanguage technologies opensupnew research opportunitiesfor 6. How easy was itfor you to adapt to thechanges that language technologies introduced inyour research processed data were uploaded at clarin:el. be found ontheproject’s website. Allofour A detailed description ofthetools usedcan sentiment analysisoftheXENO@GRdata. more and semantic,domain-specificevent lemmatisation, before embarking onthe splitting, part-of-speech tagging, and as web for services) Processing (NLP) GR project we usedNatural Language analysis. Inparticular, intheXENO@ tools

tokenisation, sentence (offered by clarin:el Figure 5:TheNLPtasks performed ontheXENO@GRdata. GREECE | INTERVIEW 13 COLOPHON

This brochure is part of the 'Tour de CLARIN' volume I (publication number: CLARIN-CE-2018-1341, November 2018).

Coordinated by Darja Fišer, Jakob Lenardič and Karolina Badzmierowska

Written by Maria Gavriilidou, Katerina T. Frantzi, Darja Fišer and Jakob Lenardič

Edited by Darja Fišer and Jakob Lenardič

Proofread by Paul Steed

Designed by Karolina Badzmierowska

Online version www.clarin.eu/Tour-de-CLARIN/Publication

Publication number CLARIN-CE-2018-1341 November 2018

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International Licence.

Contact CLARIN ERIC c/o Utrecht University Drift 10, 3512 BS Utrecht The Netherlands www.clarin.eu

CLARIN Common Language Resources and Technology Infrastructure CLARIN 2018