Russian National Corpus
Total Page:16
File Type:pdf, Size:1020Kb
Russian National Corpus ruscorpora.ru Ekaterina Rakhilina, Vladimir Plungian, Olga Lyashevskaya, Dmitry Sichinava RNC Workshop, SCLC 2014 16 Feb 2014 Harvard University 1 Preliminary plan Russian National Corpus Season 2014: hints and tricks new features and plans Corpus data for offline research Discussion Your input is much appreciated! 2 Main participants V.V.Vinogradov Russian Language Institute Russian Academy of Sciences Moscow Yandex Internet and technology company 3 Ilya Segalovitch (1964-2013) chief technical officer of Yandex RNC non-commercial partnership universities (Moscow, Saint-Petersburg, Saratov, etc.) research institutes (IPPI RAN, ILI RAN) IT-companies personal membership You are welcome to share your corpus data through RNC! New goals: Licensing issues and data distribution. 5 Corpora RusGram Statistics & offline data Dictionaries RUSCORPORA.RU family The main corpus of written Modern Russian (1700-present, 230 MW) Newspapers & news (2000-present, 174 MW) Corpus of Russian poetry (10 MW) Spoken corpus (11 MW) Multimedia corpus (4 MW) Accentuated corpus (14 MW) Parallel corpora (54 MW) Syntactic treebank (0,7 MW) Corpus of Russian dialects Russian-for-Schools corpus 7 RUSCORPORA.RU - new corpora Diachronic corpora: Old Russian Church-Slavonic Middle Russian Blogger corpus Learner corpora 8 The full body of search results is freely available online 9 ... also in KWIC format 10 Sorting results 11 Saving results in Excel format 12 Customizing subcorpus The main corpus: Modern fiction of various genres Modern drama Memoirs and biographies Journalism and literary criticism Scientific, popular scientific and teaching texts Religious and philosophical texts Technical texts Business and jurisprudence texts Day-to-day life texts, including texts not intended for publication (letters, diaries, etc.) 13 Hints & tricks sorting: надо же было ... раз...ся (рас...ся ) Мама мыла раму. hypocoristic personal names not ending with *чка, *нька use word-formation вс- prefix also with possible alternations also on the 2nd place 14 Recent news from the RNC Poetry: up to 1990-2002 MURCO: Multi-media corpus (movies, talks, etc.) types of speech situations (welcome, questioning, interview, dispute, quarrel etc.) gestures + gestures provided by speech + academic talks & discussions + Parallel Spoken Russian: Gogol's Revizor on many stages (MultiParC) Diachronic evidences (Russian in XII-XVII cc.) Parallel corpora 15 Corpus of Russian poetry Corpus of Russian poetry RUSCORPORA.RU - new corpora Diachronic corpora Old Russian & Birch letters Church-Slavonic Middle Russian Slavic parallel corpora Blogger corpus Learner corpora 18 Old-Russian Old-Russian RNC annotation: the main corpus Four major annotation layers: meta-textual annotation register/genre, author, creation date, size, etc. word-level morphosyntactic annotation lemma, POS, inflectional categories, distorted or anomalous forms etc. accentual annotation normative place of accent, accentual shifts in fixed expressions lexico-semantic annotation lexical classes of verbs, nouns, pronouns, adjectives and adverbs + new! word-formation annotation prefixes, suffixes, roots 21 N-gram viewer http://ruscorpora.ru/ngram.html word forms - Графики cf. Google Books Ngram Viewer + wildcards *сторонился year span by by date of creation, not date of publishing (cf. GoogleBooks) smoothing (3... to 20 is recommended) lemmas, not words - Распределение по годам (output page) Статистика по метаатрибутам 22 Графики : сторонился , посторонился , *сторонился Year: 1800... 2010 Smoothing: 10 Annotation mistakes and how to fix them Please tag mistakes if you come across them in the output data 25 Even more Russian corpora in cooperation with the RNC "Simple" Russian (HSE in Nizhny Novgorod) "we cannot ask 5-year-old children to read examples from the corpus" (NB students!) a subcorpus of short simple sentences, frequent words from the "lexical minimum" "Non-perfect" Russian Heritage language in Finland and USA (study of language interference) Russian as L2 in Daghestan and other parts of Russia Learner corpus of academic writing 26 Even more Russian corpora in cooperation with the RNC "Simple" Russian (HSE in Nizhny Novgorod) "we cannot ask 5-year-old children to read examples from the corpus" (NB students!) a subcorpus of short simple sentences, frequent words from the "lexical minimum" "Non-perfect" Russian Heritage language in Finland and USA (study of language interference) Russian as L2 in Daghestan and other parts of Russia Learner corpus of academic writing > 27 Корпус Академического Письма http://web- corpora.net/RussianAcademCorpus/search/ Essays, drafts of term papers, other academic texts written by students >> sociology, economics, politics, law, psychology, linguistics, management, etc. >> 1 MW available so far 28 Corpus of academic writing 29 Corpus of academic writing 3 level of mistake annotation 1) linguistic type (orthography, punctuation, lexical choice, grammatical choice & form, discourse-oriented) 2) weight (minor mistake, medium level, major/critical mistake) 3) interpretation: what is the cause? (misprint, wrong synonym, mixt of constructions, etc) 30 Russian learner corpus Heritage language http://web- corpora.net/RussianLearnerCorpus/sear ch National Heritage Language Resource Center (UCLA) Polynsky Lab in Harvard О. Kisselev, A. Alsufieva, I.Dubibina et al. E. Rakhilina and her research lab in HSE 31 32 Some examples Эти ноутбуки потребляли меньше энергии, но были менее компактнее по объему. И прибыль от разрушения гораздо более заметна и быстра, нежели чем от строительства. В русском языке семантический диапазон данного слова чрезвычайно широк, нежели в английском (Academic Writing Corpus) В России человек больше (! чаще)чаще считается расистом из за действий (Heritage Corpus) 33 Corpora RusGram Statistics & offline data Dictionaries RusGram Corpus-based Russian reference grammar traditional академическая грамматика morphology (inflection) syntax + RNC-based statistics + lexical anchors in focus substandard Russian: negative evidences or "points of future development"? 35 rusgram.ru Corpus-based dictionaries http://dict.ruslang.ru/ Frequency dictionary of Modern Russian offline version available from my homepage New grammatical dictionary Russian idiomaticity in real usage (with frequences): Which adjectival intensifier can we use with nouns? Which verb can we use with abstract nouns? Framebank (the dictionary of argument-predicate constructions attested in the RNC) offline release summer 2014 37 Corpus-based dictionaries In progress: Grammatical forms of Russian lexemes Paradigms of verbs, nouns, adjectives Distribution by time & text registers Lexical classes: comparative study 38 Corpora RusGram Statistics & offline data Dictionaries Statistics & offline use Overall idea: to show patterns in your output statistics visualization But: RNC corpus workbench is not adapted to work with customized set of data 1 step: N-grams 40 N-grams search Beta! 2-, 3-, 4-, 5- word chains не до * потрясающе (* о | *е) Most frequent N-grams - ЧАСТОТЫ In progress: Search by lemma, morphology, semantics, word formation In progress: Explore time & text registers + in any subcorpus of your choice In progress: Search with distance btw words (incl. repetitions) 41 Offline data for advanced users & computational resources NB! We are linguists, not lawyers: we cannot distribute texts But: we can share annotations & statistics on this data So far: ЧАСТОТЫ : 2-, 3-, 4-, 5-grams http://ruscorpora.ru/corpora-freq.html 1 MW Morphological standard (manually disambiguated, shuffled sentences) Plans: N-grams for other corpora + annotated data POS-annotations etc. V-S-S-CONJ-ADJ-S. 42 studiorum.ruscorpora.ru A companion web site to the RNC Corpus methods in linguistic research Corpus in teaching Russian as a second language Corpus in teaching linguistics, Russian stylistics, philology and social sciences Corpus in teaching Russian in school References (incl. PhD manuscripts and term papers) Corpus resources F.A.Q. 43 Discussion Any questions? comments? complaints? What would you like to see in the corpus? Known issues > 44 Known issues 1. A bag of words Solution: TBA soon Lemma: дуло 'muzzle' Gram: V annotated n-grams 2. *базар* database search (разбазарить, разбазаривать, пробазарить, базарчик, Базаров ) NB word-formation: just words in the dictionary 3. Search across sentence boundaries 4. Unbalansed portions of data across time который и, в, на , они не 45 Thank you! Спасибо ! http://ruscorpora.ru 46 Appendix: RNC annotation layers meta-text info morphology lexico-semantic classes 47 RNC annotation: the main corpus Four major annotation layers: meta-textual annotation register/genre, author, creation date, size, etc. word-level morphosyntactic annotation lemma, POS, inflectional categories, distorted or anomalous forms etc. accentual annotation normative place of accent, accentual shifts in fixed expressions lexico-semantic annotation lexical classes of verbs, nouns, pronouns, adjectives and adverbs + new! word-formation annotation prefixes, suffixes, roots 48 Subcorpora and meta-textual parameters >> 49 Morphological parsing Zaliznjak's (1967, 1977) formal model of Russian inflection A set of parsers based on Grammatical dictionary MYSTEM (Segalovich 2003) and DIALING (Sokirko 2004) morphological parsers in use Lemma, POS and grammatical features: Examples: взял ‘take.PAST’ <ana lex=“взять"