Digitizing, analysing and visualising data for research in Humanities and Social Sciences – use cases and problems What is science?

vs. What do they need for science? FIN-CLARIN and CLARIN

Research Infrastructure for Humanities and Social Sciences

Krister Lindén National Coordinator, FIN-CLARIN CLARIN ERIC Member countries: European Research Infrastructure Consortium The founded on February 29, 2012 https://www.clarin.eu / NL DLU [ … ] UK International cooperation USA / CMU and sharing of resources CLARIN centres

Member countries:

The Netherlands Austria Bulgaria Czech Republic Denmark DLU Estonia Finland Germany Greece Hungary Italy Latvia Lithuania Norway Poland Portugal Slovenia Sweden France UK USA / CMU What kind of data is needed? https://vlo.clarin.eu

about 1,6 million records • Computer-mediated communication corpora • Historical corpora • L2 learner corpora • Newspaper corpora • Parallel corpora • Parliamentary corpora • … Online Corpora Gw = billion words, Mw = million words, h = hours

Resources 2018 2022 Text Magazines and newspapers 1770- (NLF and Web publ.) 12 Gw 20 Gw Social media and similar sources 2000- (Suomi24, Ylilauta, …) 4 Gw 10 Gw Currently, FIN-CLARIN has Literature and manuscripts (Gutenberg, Fennica, archives) 60 Mw 70 Mw approx. 19 GW in Speech >1400 databases News broadcasts (YLE) 10000 h Video sessions from the Finnish Parliament 2008-2016 500 h 1000 h Dialect and everyday speech (Kotus, Turku) 500 h 1000 h Sign language resources (Aalto, Kuurojen liitto) 20 h 500 h Multilingual and Other Resources Multilingual Resources (EuroParl, laws, Bible, subtitles, …) 3 Gw 10 Gw Learner’s resources (Oulu, Jyväskylä, Kotus, Aalto) 2 Mw 5 Mw Open source lexicons and terminologies (Helsinki, Tromssa) 300 Kw 400 Kw What additional data sets would you like to see in Kielipankki?

• What new data set families should be surveyed in CLARIN? CLARIN resource families

Text • Computer-mediated communication corpora • Historical corpora • L2 learner corpora • Newspaper corpora • Parallel corpora Speech • Spoken corpora Picture & Video • Parliamentary corpora • Online news corpora • … What tools are needed? Slide created by Eetu Mäkelä Development & application of methods

Cleaning, Annotation, Aggregation, Visualisation Digital humanities research (CS and LT research): • NewsEye and named-entity tagging of historical newspapers • Finnish ASR (André Mansikkaniemi) ) ~96 % correct for (NatLib, EC) (cf. UutisNenä a few years ago!) parliamentary data • Pseudonymisation of court decisions (HELDIG, MinJust) • Historical newspaper re-OCR and normalization of text (Senka Drobac) CAR: ~80 % -> ~95 % • Aligning and enriching Finnish Parliamentary debates with minutes and transcripts (Aalto) • Language identification (Tommi Jauhiainen) ~99% • POS annotation (Miikka Silfverberg) ~96% Humanities and Social Science research • FiNER annotation (Pekka Kauppinen) 85-93% (cf. Stanford 73- 88%) • ComHis and understanding the dynamics of the spreading of news in Finnish newspapers (UTU) • Linguistic annotation pipeline (Jussi Piitulainen) • Citizen’s Mindscape and understanding the dynamics of • PMI metric for small or medium-sized data sets (Aleksi Finnish society (CM, HELDIG) Sahala) • Social Network Analysis of Akkadian Deities (ANEE CoE) • Like & unlike operators for semantic vectors (Sam Hardwick) • Describing semantic fields in Akkadian text using semantic • Sentiment and topic annotation (Sam Hardwick) vector operators as input (AF) • Concordance software development (Jyrki Niemi) What additional tools should be available in Kielipankki? Cleaning and understanding

Cleaning Understanding • Ingestion • Annotation • Text: Suomi24, web publications, blogs, ... • Picture element annotation • Images: Clay tablets, magazines , manuscripts • Language identification • Audio: Radio programs • Base form annotation • Video: Parliamentary and news videos • Named-entity annotation • Conversion • Linguistic annotation • Sentiment & fixed topic annotation • html2txt • OCR & HTR • Feature aggregation • ASR • Feature counting • video2txt • Feature vectors • txt2vrt • Summarization • Exploration & visualisation • Concordance • Trend diagrams • Location maps • Topic modelling • Network analysis • Close reading • Text, picture & video viewers Infrastructure for digitizing, analysing and visualising data for research in Humanities and Social Sciences www.kielipankki.fi

General support [email protected] Technical support [email protected] Kiitos! Tack! Thank you!