Language Technology Meets Documentary Linguistics: What We Have to Tell Each Other

Language Technology Meets Documentary Linguistics: What We Have to Tell Each Other

Language technology meets documentary linguistics: What we have to tell each other Language technology meets documentary linguistics: What we have to tell each other Trond Trosterud Giellatekno, Centre for Saami Language Technology http://giellatekno.uit.no/ . February 15, 2018 . Language technology meets documentary linguistics: What we have to tell each other Contents Introduction Language technology for the documentary linguist Language technology for the language society Conclusion . Language technology meets documentary linguistics: What we have to tell each other Introduction Introduction I Giellatekno: started in 2001 (UiT). Research group for language technology on Saami and other northern languages Gramm. modelling, dictionaries, ICALL, corpus analysis, MT, ... I Trond Trosterud, Lene Antonsen, Ciprian Gerstenberger, Chiara Argese I Divvun: Started in 2005 (UiT < Min. of Local Government). Infrastructure, proofing tools, synthetic speech, terminology I Sjur Moshagen, Thomas Omma, Maja Kappfjell, Børre Gaup, Tomi Pieski, Elena Paulsen, Linda Wiechetek . Language technology meets documentary linguistics: What we have to tell each other Introduction The most important languages we work on . Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist Language technology for documentary linguistics ... Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist ... what’s in it for the language community I work for? I Let’s pretend there are two types of language communities: 1. Language communities without plans for revitalisation or use in domains other than oral use 2. Language communities with such plans . Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist Language communities without such plans I Gather empirical material and do your linguistic analysis I (The triplet: Grammar, text collection and dictionary) I ... thereby doing much of the same groundwork as e.g. the neogrammarians: I A dictionary containing all words + (parallel) text collections I A phonemic and grammatical analysis I A reconstruction of protolanguages at different levels, based upon the analysis of related languages I The Fenno-Ugricists actually did this . Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist Computational methods as tools for the field linguist? I Simple letter string analysis: I Reverse-sort your lexica I ... | rev | sort | rev | ... I «Give me all stems with the vocalism -a-e-» I ... | egrep 'a[ˆaeiou]+e' | ... I Do wordform statistics (also taking variation into account) I [å|á|à|ä] (->) a , ... I «Unix for poets» and beyond . Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist Test your grammar I Let’s say you are a grammarian, I you write a morphophonological analysis of a language I Vowel harmony seems to be conditioned by these stem patterns I or want to make a generalisation over Sandhi phenomena I Combining affixes with the following segments will result in these and these changes I then you could write a computational model of your grammar I ... and test whether you generalisations actually hold: I on a corpus of (all) available text I or by generating paradigms with your alleged rules and inspect the result I (or you could do it manually, as before) . Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist Is it really worth it (for the linguist)? I The reverse-sorting? Definitely, yes. I But for the automatic grammar testing it is not that clear: I Pro: I Making a computational model of your grammar will force you to be explicit and comprehensive I And yes, it will tell you when your generalisations do (not) hold, and to what extent there is variation in your material I Con: I It is a machinery to learn I ... and perhaps you are explicit and comprehensive anyway . Language technology meets documentary linguistics: What we have to tell each other Language technology for the documentary linguist Note: There is an extra bonus! I Our imagined language society had no intention whatsoever to use the language outside traditional oral settings, or revitalise it but... I Suddenly the next generation changes its mind (many poorly documented languages have lots of speakers) . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society Language communities with plans for revitalisation / language use in the modern society I You now need to write a model for your language, and standardise it to some extent I Note the conflicting interests: I The linguist wants an orthography with all distinctions (you know what I mean) I The L1 speaker needs to produce and recognize the words I The L2 speaker needs cues to pronounce the words, and a stable norm to remember them I (We linguists have a sad tradition of confusing transliteration and orthography, and our needs with the needs of the language community) . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society For the computer, variation is not a problem I Machines are far better than humans in handling complexity and variation I Our way of modelling grammars in Tromsø: The finite state transducer . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society The solution that fell out of fashion Table: The Chomsky hierarchy Grammar Languages Automaton Production rules (constraints) Type-0 Recursively Turing machine α ! β (no restr.) enumerable Type-1 Context-sensitive Linear-bounded αAβ ! αγβ nondeterministic Turing machine Type-2 Context-free Non-deterministic A ! γ pushdown automa- ton Type-3 Regular Finite state A ! a and A ! aB automaton (FST) . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society The limitation of finite state transducers I To Chomsky: There are aspects of human language that FSTs do not cover (e.g. Swiss German) I But this is ok, we may ignore those aspects I ... and there is (in my opinion) a difference between syntax and morphology anyway I To me (my dissertation): FSTs tell me what, but not why I But this is ok as well, this is the limitation of all formal systems I The answer to the question why must be sought from outside the formal system . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society So why FSTs, and what do they give us? I A way of turning a formal model of the phonology and morphology into a bidirectional machine I The technology is not new (Koskenniemi 1983), but its virtues do not lie in its fashionability I This bidirectional machine may I ... immediately tell us whether our generalisations are empirically adequate or not I ... and provide the foundation for practical programs . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society What is an FST anyway? . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society How we write it Morphology: LEXICON NOUNS koira CASELEX ; katu CASELEX ; ... LEXICON CASELEX +N+Pl+Nom:ˆWGt # ; ... Morphophonology: t:d <=> _ Vow ˆWG: ; . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society The task for the grammarian: I Distinguish segmental from suprasegmental I make one model of each, and join them together . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society FSTs as spell checkers . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society The effect of introducing proofing tools I ... is unfortunately an under-studied topic I A personal note I My Saami correspondence is dependent upon the speller, and others tell the same I Ávvir, the North Saami daily, is dependent upon it I ... whereas NRK Sápmi do not use proofing tools, and have a home page with 7.2% Saami I The reactions in Greenland to getting a spellchecker were: I «Of course there is a spellchecker for Greenlandic! The Danes have it, so why not we?» . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society Proofing tools may be controversial I Fluent but analphabetic elderly speakers may be positive I The people who invented / govern the orthography may protest against proofing tools I They see their knowledge monopoly crumble: I «The young will not need to learn the language when there are proofing tools» I Revitalisers and language learners love the proofing tools . Language technology meets documentary linguistics: What we have to tell each other Language technology for the language society Our number 1 success story in language communities: e-dictionaries Table: Amount of base forms

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    53 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us