Swedish Diachronic Texts Resources and User Needs to Consider in a Swedish Diachronic Corpus
Total Page:16
File Type:pdf, Size:1020Kb
RAPPORTER FRÅN SWE-CLARIN SWE-CLARIN REPORT SERIES sweclarin.se/scrs [SCR-02-2019] Swedish diachronic texts Resources and user needs to consider in a Swedish diachronic corpus Eva Pettersson1 and Lars Borin2 1Department of Linguistics and Philology, Uppsala University, eva.pettersson@lingfil.uu.se 2Språkbanken Text, University of Gothenburg, [email protected] CONTENTS 1 Introduction 1 2 Corpus providers 3 2.1 Alvin................................3 2.2 Dramawebben...........................3 2.3 Fornsvenska textbanken......................4 2.4 The Gender and Work project (GaW)...............5 2.5 HaCOSSA.............................5 2.6 Jämtlands läns fornskriftsällskap..................6 2.7 Litteraturbanken..........................6 2.8 Medieval Nordic Text Archive (Menota)..............7 2.9 Project Runeberg and Project Gutenberg..............7 2.10 Samnordisk runtextdatabas.....................8 2.11 Språkbanken Text..........................9 2.12 Svenska fornskriftställskapet....................9 2.13 Vetenskapssocieten i Lund..................... 10 3 Corpora 11 3.1 Runic Swedish (800–1225)..................... 12 3.2 Old Swedish (1225–1526)..................... 13 3.2.1 Religious texts....................... 13 3.2.2 Secular prose....................... 15 ii Swedish diachronic texts 3.2.3 Letters and charters.................... 15 3.2.4 Scientific text (medicine)................. 16 3.2.5 Court records....................... 17 3.2.6 Laws and regulations................... 18 3.2.7 Accounts and registers.................. 18 3.2.8 Old Swedish: Summary.................. 19 3.3 Early Modern Swedish (1526–1732)................ 21 3.3.1 Religious texts....................... 21 3.3.2 Secular prose....................... 22 3.3.3 Diaries and personal stories................ 22 3.3.4 Song texts......................... 23 3.3.5 Periodicals......................... 23 3.3.6 Letters and charters.................... 23 3.3.7 Academic text....................... 24 3.3.8 Court records....................... 24 3.3.9 Laws and regulations................... 26 3.3.10 Governmental texts.................... 26 3.3.11 Accounts and registers.................. 26 3.3.12 Maps............................ 27 3.3.13 Text fragments....................... 28 3.3.14 Early Modern Swedish: Summary............ 28 3.4 Late Modern Swedish (1732–1900)................ 30 3.4.1 Religious texts....................... 30 3.4.2 Secular prose....................... 30 3.4.3 Diaries and personal stories................ 32 3.4.4 Periodicals......................... 33 3.4.5 Newspaper text...................... 33 Resources and user needs to consider in a Swedish diachronic corpus iii 3.4.6 Letters........................... 34 3.4.7 Academic and scientific text............... 34 3.4.8 Court records....................... 34 3.4.9 Laws and regulations................... 35 3.4.10 Governmental texts.................... 36 3.4.11 Late Modern Swedish: Summary............. 36 3.5 Contemporary Swedish (1900–).................. 38 3.5.1 Religious texts....................... 38 3.5.2 Secular prose....................... 38 3.5.3 Diaries and personal stories................ 39 3.5.4 Song texts......................... 39 3.5.5 Periodicals......................... 39 3.5.6 Newspaper text...................... 40 3.5.7 Letters........................... 40 3.5.8 Academic and scientific text............... 40 3.5.9 Court records....................... 41 3.5.10 Laws and regulations................... 42 3.5.11 Governmental texts.................... 42 3.5.12 Essays and school-related texts.............. 42 3.5.13 Accounts and registers.................. 42 3.5.14 Social media text and Wikipedia............. 43 3.5.15 Contemporary Swedish: Summary............ 43 4 User questionnaire 46 4.1 Questionnaire responses: summary................. 47 4.1.1 Research questions for a diachronic corpus........ 47 4.1.2 Previously used corpora.................. 48 4.1.3 Time periods covered................... 50 iv Swedish diachronic texts 4.1.4 Division into time periods................. 51 4.1.5 Text types covered..................... 51 4.1.6 The definition of Swedish................. 52 4.1.7 Search parametres..................... 53 4.1.8 Metadata information................... 54 4.2 Questionnaire responses: Discussion and implications...... 55 5 Summary and conclusions 57 5.1 Future work............................. 59 References 62 1 INTRODUCTION HE CLARIN research infrastructure1 (short for “Common Language Re- T sources and Technology Infrastructure”) aims to make digital language resources available to researchers from all disciplines, with a special focus on the humanities and social sciences. As part of the activities in the Swedish CLARIN node, Swe-Clarin,2 we aim to develop a freely accessible Swedish diachronic corpus. We strongly believe that the existence of such a resource would be very valuable to facilitate large-scale research on Swedish language change, and to enable comparative studies of the Swedish language develop- ment as compared to other languages for which diachronic corpora exist. In a preceding companion report, we investigated the structure and contents of diachronic and historical corpora available for other languages, with the aim to identify important aspects to be taken into consideration in the development of a Swedish diachronic corpus, and how the corpus could be structured in order for it to be comparable to other diachronic and historical corpora (Pettersson and Borin 2019). When planning for the structure and contents of the Swedish diachronic corpus, it is also crucial to have an idea of what text types and amounts of text that are available for different time periods throughout the history of the Swedish language. In the current report, we therefore continue our work towards a Swedish diachronic corpus by examining textual resources available for the Swedish language, for different time periods and for different genres. In addition, we want to take the needs and wishes of the primary target users into consideration when building the corpus. For this reason, we sent out a questionnaire to a number of researchers in the humanities with a special interest in historical linguistics and the development of the Swedish language, asking for their experience of historical corpora, and what features that are important in order for a Swedish diachronic corpus to be useful for them. 1https://www.clarin.eu/ 2https://sweclarin.se/ 2 Swedish diachronic texts The report is structured as follows: In Section2, we present corpus providers of special interest to our goals, and in Section3, we give an overview of avail- able textual resources for Swedish, based on the traditional division into time periods, i.e. Runic Swedish (appr. 800–1225), Old Swedish (appr. 1225–1526), Early Modern Swedish (appr. 1526–1732), Late Modern Swedish (appr. 1732– 1900), and Contemporary Swedish (appr. 1900–) (Bergman 1995). The an- swers to the user questionnaire are reported in Section4. Finally, the findings are summarized and conclusions are drawn in Section5, including some direc- tions for future work. 2 CORPUS PROVIDERS N THIS SECTION, we give a short presentation of corpus providers that are I of special interest to the development of a Swedish diachronic corpus. 2.1 ALVIN Alvin3 is a national platform for long-term preservation of and accessibility to digitised collections and digital cultural heritage material from Swedish cul- tural heritage institutions. It is hosted by the Uppsala University Library, in collaboration with the Gothenburg University Library and Lund University Li- brary. Several other cultural heritage institutions are also members of Alvin. The Alvin database contains a wide range of genres, and is constantly growing, as new material is digitised by the libraries. Most of the digitised material in Alvin is free to download and use, but material from the 1920s and onwards may be copyright-protected. Unfortunately for our purposes, most of the texts in Alvin are only available as images, and there is currently no indexing in the database to tell which documents that are also available in a transcribed format. More information on the Alvin platform can be found here: • Alvin webpage: http://info.alvin-portal.org/ 2.2 DRAMAWEBBEN Dramawebben4 (The Drama Web) aims to make texts representative of the Swedish history of theatre available to the general public. The corpus includes first edition dramas for adults and for children, distributed over widely differ- ent genres, such as comedy, tragedy, drama, farce, and fairy plays. In total, 3http://info.alvin-portal.org/ 4https://litteraturbanken.se/dramawebben 4 Swedish diachronic texts Dramawebben distributes 456 dramas, written by 49 male and 78 female au- thors, born in the time period 1579–1894. About 200 of these plays have been OCRed. Dramawebben is part of Litteraturbanken (see further Section 2.7) since 2018. Part of the corpus (currently 790,456 tokens) is also available for download via Språkbanken Text (see further Section 2.11), and searchable through the Korp interface. More information on Dramawebben can be found here: • Dramawebben webpage: https://litteraturbanken.se/dramawebben/om • Dramawebben at Språkbanken Text: https://spraakbanken.gu.se/swe/resurs/ drama 2.3 FORNSVENSKA TEXTBANKEN Fornsvenska textbanken (Delsing 2002) is a collection of machine-readable editions of Old Swedish and Early Modern Swedish texts, covering the time period 1162–1758. The collection currently contains