Morphological analyzers and other digital tools for Jack Rueter University of Helsinki, Digital Humanities – Some Terminology – Word associations – Working out methodology – Morphological analyzers under development An outline – Where it started – Setting up a rule-based description – Tools – Important players

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 2 – Open-source – Language form – Rule-based – Finite-state morphology Some – Disambiguation terminology – Multiple reusability – Linguistics and Language technology – Neural networks – Transfer learning – Active learning

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 3 – (1) Keyboards – (2) Spellers Word – (3) Dictionaries associations – (4) ICALL for multi- – (5) Translation reusability – (6) Text-to-speech – (7) ….

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 4 – (1) Extract paradigms from grammars, readers and research to build an analyzer. – (2) Extract words, part-of-speech information and definitions from existing dictionaries and research. Build on what has already been done (Dutch, French, German, Russian,…) – (3) Test analysis coverage on written texts. Are the forms Searching for a unrecognized proper words? – (4) Disambiguate morphological analyses based on grammars and methodology research. Point out gaps in descriptions for linguistics – (5) Test syntactic disambiguation on example sentences cited in grammatical descriptions of the language. And then retest on text corpora. – (6) Make disambiguated sentences public, so others can test. One by-product of these golden standards are treebanks. – (7) Use all phases to benefit the speaker and research community

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 5 – First transducers of minority Uralic languages after Finnish 1983 Open-source (Kimmo Koskenniemi) morphological – Meadow Mari ~1986 (Jorma Luutonen) – Komi-Zyrian 1996 (Jack Rueter) descriptions – Giellatekno ~2000 begins work with Sami descriptions (Trond for Uralic Trosterud et al) – Barents Sea languages, Circum Polar languages languages – ~2004-> other Uralic languages

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 6 – Balto-Finnic: fit = Meänkieli, fkv = Kveen, izh = Ingrian, krl = Karelian, liv = Livonian, olo = Olonets-Karelian aka Livvi, vep = Minority Uralic Veps, vot = Votic, vro = Võro – Sami: sjd = Kildin Sami, sje = , sma = South Sami, sme = language , smj = , smn = , sms = forms with – Mordvin: mdf = Moksha, myv = Erzya finite-state – Mari: mhr = Meadow & Eastern Mari, mrj = Hill Mari aka Western morphology Mari development – Permic: koi = Komi-Permyak, kpv = Komi-Zyrian, udm = Udmurt – Ob Ugrian: kca = , mns = Mansi – Samoyedic: nio = Nganasan , sel = Selkup, yrk = Nenets

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 7 Pertinent majority – Uralic languages in majority: est = Estonian, fin = Finnish, hun = languages with Hungarian – Auxilliary languages: deu = German, lav = Latvian, nob = finite-state Norwegian Bokmål, rus = Russian, tat = Tatar morphology development

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 8 Setting up a – Find a source and use the known morphological information – Morphological Find or build a lexicon to propogate this word type analyzer

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 9 Use known paradigmatic information, Ingrian (Junius 1936)

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 10 Make a test file

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 11 Generate an initial paradigm type

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 12 Propogate the lexicon

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 13 Divvun for language community and technology

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 14 – Keyboards Giellalt/ – Spell checkers: Hunspell, Voikko – Click-in-text dictionaries Tools – Language learning – Text-to-speech – Translation

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 15 – Click-in-text dictionaries, – Giella: – https://sanit.oahpa.no/ – https://saan.oahpa.no/ , https://sanat.oahpa.no/, https://valks.oahpa.no/, https://muter.oahpa.no/, https://kyv.oahpa.no/, https://vada.oahpa.no/ Giella – Language internal and external links – Akusanat: Dictionaries – https://www.akusanat.com/, – https://www.akusanat.com/verdd – https://www.akusanat.com/semantics

– Material Collaborations: FU-Lab, University of Turku, University of Tartu, University, EKI, University of Vienna, Livones, Võro Instituut

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 16 – ICALL at Giella (Giellatekno & Divvun) Language – Northern Sami (Flag ship) http://oahpa.no/davvi learning

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 17 Translation – Apertium: Shallow-transfer for closely related languages – http://wiki.apertium.org/wiki/Main_Page

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 18 Text-to-speech – Common Voice (Mozilla) – https://voice.mozilla.org/en

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 19 – EKI in Tallinn, – http://portaal.eki.ee/sonaraamatud.html – FU-Lab in Syktyvkar, Komi Republic – https://fu-lab.ru/fulabteam – Giella in Tromsø, Norway Infrastructures – http://giellatekno.uit.no/ with intense – Mari Research Institute for Language, Literature and History – activity http://marnii.ru/ – UL Livonian institute – http://www.livones.net/liv – Võro Instituut – https://wi.ee/

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 20 Aitäh! Thank you!

11/10/19 Jack Rueter, Ph.D. University of Helsinki, Digital Humanities 21