<<

Language technology meets documentary : What we have to tell each other

Language technology meets documentary linguistics: What we have to tell each other

Trond Trosterud Giellatekno, Centre for Saami Language Technology http://giellatekno.uit.no/

...... February 15, 2018 ...... Language technology meets documentary linguistics: What we have to tell each other

Contents

Introduction

Language technology for the documentary linguist

Language technology for the language society

Conclusion

...... Language technology meets documentary linguistics: What we have to tell each other Introduction

Introduction

▶ Giellatekno: started in 2001 (UiT). Research group for language technology on Saami and other northern languages Gramm. modelling, dictionaries, ICALL, corpus analysis, MT, ... ▶ Trond Trosterud, Lene Antonsen, Ciprian Gerstenberger, Chiara Argese ▶ Divvun: Started in 2005 (UiT < Min. of Local Government). Infrastructure, proofing tools, synthetic speech, terminology ▶ Sjur Moshagen, Thomas Omma, Maja Kappfjell, Børre Gaup, Tomi Pieski, Elena Paulsen, Linda Wiechetek

...... Language technology meets documentary linguistics: What we have to tell each other Introduction

The most important languages we work on

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist

Language technology for documentary linguistics ...

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist

... what’s in it for the language community I work for?

▶ Let’s pretend there are two types of language communities: 1. Language communities without plans for revitalisation or use in domains other than oral use 2. Language communities with such plans

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist

Language communities without such plans

▶ Gather empirical material and do your linguistic analysis ▶ (The triplet: Grammar, text collection and dictionary) ▶ ... thereby doing much of the same groundwork as e.g. the neogrammarians: ▶ A dictionary containing all + (parallel) text collections ▶ A phonemic and grammatical analysis ▶ A reconstruction of protolanguages at different levels, based upon the analysis of related languages ▶ The Fenno-Ugricists actually did this

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist

Computational methods as tools for the field linguist?

▶ Simple letter string analysis: ▶ Reverse-sort your lexica ▶ ... | rev | sort | rev | ... ▶ «Give me all stems with the vocalism -a-e-» ▶ ... | egrep 'a[ˆaeiou]+e' | ... ▶ Do wordform statistics (also taking variation into account) ▶ [å|á|à|ä] (->) a , ... ▶ «Unix for poets» and beyond

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist

Test your grammar

▶ Let’s say you are a grammarian, ▶ you write a morphophonological analysis of a language ▶ Vowel harmony seems to be conditioned by these stem patterns ▶ or want to make a generalisation over Sandhi phenomena ▶ Combining affixes with the following segments will result in these and these changes ▶ then you could write a computational model of your grammar ▶ ... and test whether you generalisations actually hold: ▶ on a corpus of (all) available text ▶ or by generating paradigms with your alleged rules and inspect the result ▶ (or you could do it manually, as before)

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist

Is it really worth it (for the linguist)?

▶ The reverse-sorting? Definitely, yes. ▶ But for the automatic grammar testing it is not that clear: ▶ Pro: ▶ Making a computational model of your grammar will force you to be explicit and comprehensive ▶ And yes, it will tell you when your generalisations do (not) hold, and to what extent there is variation in your material ▶ Con: ▶ It is a machinery to learn ▶ ... and perhaps you are explicit and comprehensive anyway

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist

Note: There is an extra bonus!

▶ Our imagined language society had no intention whatsoever to use the language outside traditional oral settings, or revitalise it but... ▶ Suddenly the next generation changes its mind (many poorly documented languages have lots of speakers)

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

Language communities with plans for revitalisation / language use in the modern society

▶ You now need to write a model for your language, and standardise it to some extent ▶ Note the conflicting interests: ▶ The linguist wants an orthography with all distinctions (you know what I mean) ▶ The L1 speaker needs to produce and recognize the words ▶ The L2 speaker needs cues to pronounce the words, and a stable norm to remember them ▶ (We linguists have a sad tradition of confusing transliteration and orthography, and our needs with the needs of the language community)

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

For the computer, variation is not a problem

▶ Machines are far better than humans in handling complexity and variation ▶ Our way of modelling grammars in Tromsø: The finite state transducer

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

The solution that fell out of fashion

Table: The Chomsky hierarchy

Grammar Languages Automaton Production rules (constraints) Type-0 Recursively Turing machine α → β (no restr.) enumerable Type-1 Context-sensitive Linear-bounded αAβ → αγβ nondeterministic Turing machine Type-2 Context-free Non-deterministic A → γ pushdown automa- ton Type-3 Regular Finite state A → a and A → aB automaton (FST)

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

The limitation of finite state transducers

▶ To Chomsky: There are aspects of human language that FSTs do not cover (e.g. Swiss German) ▶ But this is ok, we may ignore those aspects ▶ ... and there is (in my opinion) a difference between syntax and morphology anyway ▶ To me (my dissertation): FSTs tell me what, but not why ▶ But this is ok as well, this is the limitation of all formal systems ▶ The answer to the question why must be sought from outside the formal system

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

So why FSTs, and what do they give us?

▶ A way of turning a formal model of the phonology and morphology into a bidirectional machine ▶ The technology is not new (Koskenniemi 1983), but its virtues do not lie in its fashionability ▶ This bidirectional machine may ▶ ... immediately tell us whether our generalisations are empirically adequate or not ▶ ... and provide the foundation for practical programs

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

What is an FST anyway?

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

How we write it

Morphology: LEXICON NOUNS koira CASELEX ; katu CASELEX ; ... LEXICON CASELEX +N+Pl+Nom:ˆWGt # ; ...

Morphophonology: t:d <=> _ Vow ˆWG: ;

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

The task for the grammarian:

▶ Distinguish segmental from suprasegmental ▶ make one model of each, and join them together

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

FSTs as spell checkers

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

The effect of introducing proofing tools

▶ ... is unfortunately an under-studied topic ▶ A personal note ▶ My Saami correspondence is dependent upon the speller, and others tell the same ▶ Ávvir, the North Saami daily, is dependent upon it ▶ ... whereas NRK Sápmi do not use proofing tools, and have a home page with 7.2% Saami ▶ The reactions in Greenland to getting a spellchecker were: ▶ «Of course there is a spellchecker for Greenlandic! The Danes have it, so why not we?»

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

Proofing tools may be controversial

▶ Fluent but analphabetic elderly speakers may be positive ▶ The people who invented / govern the orthography may protest against proofing tools ▶ They see their knowledge monopoly crumble: ▶ «The young will not need to learn the language when there are proofing tools» ▶ Revitalisers and language learners love the proofing tools

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

Our number 1 success story in language communities: e-dictionaries

Table: Amount of base forms in running text North Saami Finnish Norwegian Number of words in the text 252 461 45 144 64 994 Number of lemmas in the dictionary 99 071 94 111 38 983 Coverage 7,9 % 10,0 % 30,5 %

(Antonsen et al 2009)

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

Morphology-enriched e-dictionaries vs. paper dictionaries?

▶ e-dictionaries: ▶ are faster to use than paper dictionaries, and contain all forms ▶ are indispensable for languages with prefixing or suprasegmental morphology ▶ Clicking on words is fast enough to not forget what sentence you were reading ▶ E -dictionaries handles variation and sloppy typing ▶ Paper dictionaries ▶ are nice status symbols in the bookshelf ▶ are robust, do not need electricity or internet ▶ The point is not the competition, we may have both,

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

Our morphology-enriched e-dictionaries

... use FST-analysis and translation via a web service

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

The dictionary understands inflected forms

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

... and may generate paradigms via FST

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

Reading dictionary: read and alt-click

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

Skolt Saami

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

Komi

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

Spell relax

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

▶ http://sanit.oahpa.no ▶ North Saami ↔ Norwegian, Finnish

▶ http://baakoeh.oahpa.no ▶ South Saami ↔ Norwegian

▶ http://saan.oahpa.no ▶ Skolt Saami ↔ Finnish,

▶ http://sanat.oahpa.no ▶ Olonets → Finnish // Kven ↔ Norwegian

▶ http://sonad.oahpa.no ▶ Livonian → Finnish, Estonian, Latvian, // Votic, Võro, Ingrian ↔ ...

▶ http://valks.oahpa.no ▶ Erzya → English, Finnish, Russian, French // Mokša → Finnish, French

▶ http://muter.oahpa.no ▶ Eastern Mari → Finnish // Western Mari → Finnish

▶ http://kyv.oahpa.no ▶ Komi ↔ English, Finnish

▶ http://vada.oahpa.no ▶ Nenets → English, Finnish

▶ http://pikiskwewina.oahpa.no ▶ Plains Cree → English

▶ http://guusaaw.oahpa.no ▶ Northern Haida → English ...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

What happened to syntax?

▶ People have tried doing syntax with fst — and failed ▶ ... which I find nice: ▶ Contrary to the dominant view within theoretical linguistics, I see morphology and syntax as very different ▶ Note: fst is a filter, as is the rule S → NP VP ▶ HPSG and LFG use this formalism, but have still not been able to turn it into robust grammars ▶ Our alternative:

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

Constraint grammar (CG) and (UD)

▶ Like UG, both CG and the other rule-based approaches (LFG and HPSG) offer deep linguistic analysis ▶ CG differs from the other rule-based approaches, though: ▶ in achieving high precision and recall ▶ in being used in end-user programs ▶ ... and in being a word-based, bottom-up approach ▶ The linguistic foundation of CG and UD is thus ▶ lexicalism: wordforms = lemma + morphosyntactic properties ▶ minimalism: no phrase structure nodes, empty or otherwise ▶ bottom-up : words and their dependencies

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

Constraint grammar in practice

▶ Our North Saami constraint grammar for morphosyntactic disambiguation contains thousands of hand-written rules ▶ Other parts of the CG is easier: ▶ POS disambiguation and (for e.g. lexicography) requires only (some) hundred rules ▶ Assigning syntactic functions requires the same ▶ Dependency grammar may be reused (North, Lule, South Saami, Faroese, Greenlandic!) ▶ CG can be used for other purposes as well ▶ Assigning semantic roles based on syntax + semantic categories ▶ Making grammarcheckers and e-learning programs ▶ Lexical disambiguation for MT, dictionary lookup

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

Why morphosyntactic features and not glossing?

▶ FSTs are neutral on the issue ▶ they may provide glossing ▶ and suffix boundaries (we do that for our spellers) ▶ For large corpora, the interesting aspect is the outer aspect of the wordform, not the inner ▶ with 9 821 instances of bohtet, we do not want its inner analysis 9821 times ▶ but we want to know whether this particular instance is Prs Pl3, Prt Sg2 or Imprt Pt2

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

Putting the components together

Saami language technology at the Universitety of Tromsø – Giellatekno and Divvun: Linguistics in practice

End user programs Grammatical analysis End user programs

Analysed Phonology

Semantics

Intelligent computer-assisted Bilingual transfer lexica Constraint grammar language learning ICALL + transfer rules Bilingual wordlists Dependency tree

Constraint grammar Add syntactic functions

Disambiguate

Intelligent dictionaries Proofing tools Finite state transducer Bilingual wordlists Morpho- phonology

Morphology

Lexicon ...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

The Giella infrastructure at UiT

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

The language-independent workflow

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

Our offer: Cutting the S-curve

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

Spellcheckers and grammarcheckers

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

ICALL programs

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

Other tools and derivatives

▶ Machine translation (Giella infra → Apertium) ▶ A pipeline for easy keyboard generation for phone and computer ▶ for TTS ▶ ...

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

What this then leads to

▶ Not sociolinguistics, but linguistics in society, potentially with tremendous impact on language communities ▶ ... and therefore potentially interesting to all linguists

...... Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society

And in case you thought I forgot:

▶ What documentary linguistics has to tell language technologists ▶ The facts about the language in question ▶ Without you, we will have nothing to formalise ▶ As a linguist I may add: Without grammars and analyses, language technology becomes boring

...... Language technology meets documentary linguistics: What we have to tell each other Conclusion

Conclusion

▶ Language communities wanting to take the language to new generations and domains will not manage without language technology ▶ Revitalising their language and putting it into use in a modern society has language technology as one of its most important components. ▶ Field linguists have so far managed without computational modelling, but it would be nice to have models for grammar testing, annotation, and corpus analysis ▶ both for your corpus analysis and for your forthcoming reference grammar

http://giellatekno.uit.no/doc http://giellatekno.uit.no/doc/lang http://divvun.no, http://divvun.org ......