Language Technology Meets Documentary Linguistics: What We Have to Tell Each Other

Language technology meets documentary linguistics: What we have to tell each other Language technology meets documentary linguistics: What we have to tell each other Trond Trosterud Giellatekno, Centre for Saami Language Technology http://giellatekno.uit.no/ . February 15, 2018 . Language technology meets documentary linguistics: What we have to tell each other Contents Introduction Language technology for the documentary linguist Language technology for the language society Conclusion . Language technology meets documentary linguistics: What we have to tell each other Introduction Introduction I Giellatekno: started in 2001 (UiT). Research group for language technology on Saami and other northern languages Gramm. modelling, dictionaries, ICALL, corpus analysis, MT, ... I Trond Trosterud, Lene Antonsen, Ciprian Gerstenberger, Chiara Argese I Divvun: Started in 2005 (UiT < Min. of Local Government). Infrastructure, proofing tools, synthetic speech, terminology I Sjur Moshagen, Thomas Omma, Maja Kappfjell, Børre Gaup, Tomi Pieski, Elena Paulsen, Linda Wiechetek . Language technology meets documentary linguistics: What we have to tell each other Introduction The most important languages we work on . Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist Language technology for documentary linguistics ... Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist ... what’s in it for the language community I work for? I Let’s pretend there are two types of language communities: 1. Language communities without plans for revitalisation or use in domains other than oral use 2. Language communities with such plans . Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist Language communities without such plans I Gather empirical material and do your linguistic analysis I (The triplet: Grammar, text collection and dictionary) I ... thereby doing much of the same groundwork as e.g. the neogrammarians: I A dictionary containing all words + (parallel) text collections I A phonemic and grammatical analysis I A reconstruction of protolanguages at different levels, based upon the analysis of related languages I The Fenno-Ugricists actually did this . Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist Computational methods as tools for the field linguist? I Simple letter string analysis: I Reverse-sort your lexica I ... | rev | sort | rev | ... I «Give me all stems with the vocalism -a-e-» I ... | egrep 'a[ˆaeiou]+e' | ... I Do wordform statistics (also taking variation into account) I [å|á|à|ä] (->) a , ... I «Unix for poets» and beyond . Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist Test your grammar I Let’s say you are a grammarian, I you write a morphophonological analysis of a language I Vowel harmony seems to be conditioned by these stem patterns I or want to make a generalisation over Sandhi phenomena I Combining affixes with the following segments will result in these and these changes I then you could write a computational model of your grammar I ... and test whether you generalisations actually hold: I on a corpus of (all) available text I or by generating paradigms with your alleged rules and inspect the result I (or you could do it manually, as before) . Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist Is it really worth it (for the linguist)? I The reverse-sorting? Definitely, yes. I But for the automatic grammar testing it is not that clear: I Pro: I Making a computational model of your grammar will force you to be explicit and comprehensive I And yes, it will tell you when your generalisations do (not) hold, and to what extent there is variation in your material I Con: I It is a machinery to learn I ... and perhaps you are explicit and comprehensive anyway . Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist Note: There is an extra bonus! I Our imagined language society had no intention whatsoever to use the language outside traditional oral settings, or revitalise it but... I Suddenly the next generation changes its mind (many poorly documented languages have lots of speakers) . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society Language communities with plans for revitalisation / language use in the modern society I You now need to write a model for your language, and standardise it to some extent I Note the conflicting interests: I The linguist wants an orthography with all distinctions (you know what I mean) I The L1 speaker needs to produce and recognize the words I The L2 speaker needs cues to pronounce the words, and a stable norm to remember them I (We linguists have a sad tradition of confusing transliteration and orthography, and our needs with the needs of the language community) . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society For the computer, variation is not a problem I Machines are far better than humans in handling complexity and variation I Our way of modelling grammars in Tromsø: The finite state transducer . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society The solution that fell out of fashion Table: The Chomsky hierarchy Grammar Languages Automaton Production rules (constraints) Type-0 Recursively Turing machine α ! β (no restr.) enumerable Type-1 Context-sensitive Linear-bounded αAβ ! αγβ nondeterministic Turing machine Type-2 Context-free Non-deterministic A ! γ pushdown automaton Type-3 Regular Finite state A ! a and A ! aB automaton (FST) . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society The limitation of finite state transducers I To Chomsky: There are aspects of human language that FSTs do not cover (e.g. Swiss German) I But this is ok, we may ignore those aspects I ... and there is (in my opinion) a difference between syntax and morphology anyway I To me (my dissertation): FSTs tell me what, but not why I But this is ok as well, this is the limitation of all formal systems I The answer to the question why must be sought from outside the formal system . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society So why FSTs, and what do they give us? I A way of turning a formal model of the phonology and morphology into a bidirectional machine I The technology is not new (Koskenniemi 1983), but its virtues do not lie in its fashionability I This bidirectional machine may I ... immediately tell us whether our generalisations are empirically adequate or not I ... and provide the foundation for practical programs . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society What is an FST anyway? . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society How we write it Morphology: LEXICON NOUNS koira CASELEX ; katu CASELEX ; ... LEXICON CASELEX +N+Pl+Nom:ˆWGt # ; ... Morphophonology: t:d <=> _ Vow ˆWG: ; . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society The task for the grammarian: I Distinguish segmental from suprasegmental I make one model of each, and join them together . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society FSTs as spell checkers . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society The effect of introducing proofing tools I ... is unfortunately an under-studied topic I A personal note I My Saami correspondence is dependent upon the speller, and others tell the same I Ávvir, the North Saami daily, is dependent upon it I ... whereas NRK Sápmi do not use proofing tools, and have a home page with 7.2% Saami I The reactions in Greenland to getting a spellchecker were: I «Of course there is a spellchecker for Greenlandic! The Danes have it, so why not we?» . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society Proofing tools may be controversial I Fluent but analphabetic elderly speakers may be positive I The people who invented / govern the orthography may protest against proofing tools I They see their knowledge monopoly crumble: I «The young will not need to learn the language when there are proofing tools» I Revitalisers and language learners love the proofing tools . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society Our number 1 success story in language communities: e-dictionaries Table: Amount of base forms

Language Technology Meets Documentary Linguistics: What We Have to Tell Each Other

Enhanced Thesaurus Terms Extraction for Document Indexing

Arxiv:1908.07448V1

Extended and Enhanced Polish Dependency Bank in Universal Dependencies Format

Universal Dependencies According to BERT: Both More Specific and More General

An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification

Using Lexico-Syntactic Ontology Design Patterns for Ontology Creation and Population

LASLA and Collatinus

Experiments in Clustering Urban-Legend Texts

Removing Boilerplate and Duplicate Content from Web Corpora

The State of (Full) Text Search in Postgresql 12

A Diachronic Treebank of Russian Spanning More Than a Thousand Years

Universal Dependencies for Japanese