Prendo La Parola in Questo Consesso Mondiale: a Multi-Genre 20Th Century Corpus in the Political Domain
Total Page:16
File Type:pdf, Size:1020Kb
Prendo la Parola in Questo Consesso Mondiale: A Multi-Genre 20th Century Corpus in the Political Domain Sara Tonelliy, Rachele Sprugnoliz, Giovanni Morettiyz yFondazione Bruno Kessler, Trento, Italy zUniversita` Cattolica, Milano, Italy fsatonelli,[email protected] [email protected] Abstract major issues, especially in those countries where no or only limited public initiatives have been un- English. In this paper we present a multi- dertaken to support the distribution of this kind genre corpus spanning 50 years of Euro- of documents. For example, while in the US pean history. It contains a comprehensive the Federal Digital System grants access to pub- collection of Alcide De Gasperi’s pub- lic Presidential documents through APIs and bulk- lic documents, 2,762 in total, written or data repositories, in Italy an effort along this line transcribed between 1901 and 1954. The has started only recently with the support of the corpus comprises different types of texts, Archive of the President of the Republic6, but has including newspaper articles, propaganda not delivered substantial results so far. documents, official letters and parliamen- tary speeches. The corpus is freely avail- This work represents a first attempt to deal with able and includes several annotation lay- this lack of data, since we present and make avail- ers, i.e. key-concepts, lemmas, PoS tags, able a large corpus of Italian public documents in person names and geo-referenced places, the political domain. In particular, we release a representing a high-quality ‘silver’ anno- comprehensive collection of Alcide De Gasperi’s tation. We believe that this resource can public documents issued between 1901 and 1954, foster research in historical corpus anal- which had been previously published in four vol- ysis, stylometry and computational social umes by Il Mulino (De Gasperi, 2006; De Gasperi, science, among others.1 2008a; De Gasperi, 2008b; De Gasperi, 2009) but were not machine-readable. Our repository con- 1 Introduction tains all documents in three formats: txt, XML and tab-separated. Raw text files contain only In recent years, political scientists and history the body of the documents, and may be straight- scholars have started to exploit the availability of forwardly used to extract embeddings or topics. digital material to enrich their research, taking ad- XML files include metadata that cover not only the vantage of freely accessible online archives and title, the date and the place of publication, but also easy-to-use tools for text processing and data ex- key-concepts automatically extracted from each traction. Active communities have been created text and genre labels manually assigned by do- around topics such as the study of Parliamentary main experts. Furthermore, the release includes corpora (see the ParlaCLARIN2 and ParlaFormat silver annotation for lemma, part of speech, per- workshops3), the analysis of political manifestos4 son names and place names with associated co- and of Presidential speeches.5 Despite the impor- ordinates in a CoNLL-like format. All files and tance of this research field, copyright and avail- the corresponding descriptions can be downloaded ability in machine-readable format still represent at https://dh.fbk.eu/technologies/ 1Copyright ©2019 for this paper by its authors. Use per- corpus-de-gasperi (with CC BY-NC-SA li- mitted under Creative Commons License Attribution 4.0 In- ternational (CC BY 4.0). cense). The corpus can also be navigated using 2https://www.clarin.eu/ParlaCLARIN the ALCIDE platform (Moretti et al., 2016) at this 3https://www.clarin.eu/event/2019/ link: http://alcidedigitale.fbk.eu/. parlaformat-workshop 4https://manifesto-project.wzb.eu/ 5https://www.presidency.ucsb.edu/ documents 6https://archivio.quirinale.it/aspr/ 2 Related Work notated and then partially revised by hand (De Fe- lice et al., 2018). Compared with these two last The political domain has been studied in com- works, our corpus is broader, having a multi- putational linguistics from various perspectives. layered semantic analysis, and completely avail- Annotated corpora have been created to analyse able for download in different formats, thus open rhetoric and metaphors in political communica- to further analysis by the research community. tion (Cardie and Wilkerson, 2008; Ahrens et al., 2018), study the impact of speeches on the audi- 3 Corpus Description ence (Guerini et al., 2013; Thomas et al., 2006) and understand the relationship between ideol- Our corpus contains the complete collection of ogy and linguistic complexity (Schoonvelde et al., public documents by Alcide De Gasperi, the first 2019). Resources have also been developed to Prime Minister of the Italian Republic and one of train and test automatic systems for several types the founding fathers of the European Union. It in- of NLP tasks, such as persuasiveness prediction cludes 2,762 documents published between 1901 (Strapparava et al., 2010), sentiment and emotion and 1954, for a total of around 3,000,000 tokens. analysis (Young and Soroka, 2012; Rheault et al., The corpus is released as raw text, as XML with 2016), text classification (Yu et al., 2008), topic- a minimal set of meta-data and associated key- based agreement detection (Menini et al., 2017) concepts, and as CoNLL-like format, with addi- and recognition of ideological positions (Hirst et tional information that have been fully or semi- al., 2010). automatically annotated (see Section 4). Texts, Many research activities have recently dealt date and place of publication were automatically with the digitisation and release of corpora con- generated starting from the PDF files used to is- taining historical political texts. For example, the sue the volumes edited by Il Mulino. Each doc- corpus of speeches given in the British Parlia- ument of the collection was classified manually ment from 1803 to 2005 (i.e. the Hansard Cor- by a group of history scholars on the basis of a pus) has been automatically tagged using the His- two-layered hierarchy that takes into consideration torical Thesaurus Semantic Tagger (Piao et al., whether the text was originally released in an oral 2014; Wattam et al., 2014) and then a part of it or written form, and its specific genre. It is impor- has been semantically enriched with information tant to note that different text genres correspond to about speakers and topics (Nanni et al., 2019). different roles covered by De Gasperi during his In addition, the Canadian Parliamentary Debates life: e.g. daily press when he worked as a journal- (1901-present) have been standardised, enriched ist for newspapers in Trentino, speeches in institu- and distributed within the “Digging into Linked tional venues when he was a Member of the Italian Parliamentary Data” project (Beelen et al., 2017). Parliament. The period from 1947 to 2017 is instead covered History scholars identified also four time spans by a dataset of Dutch and Danish party congress to which each document can be assigned, that speeches (Schumacher et al., 2019). characterise different periods in De Gasperi’s life. As for Italian, to the best of our knowledge, These correspond to the four volumes of the the only available comprehensive study of the lan- printed edition and are used to split the corpus into guage of Italian politicians is the one by Bolasco different periods based on the date of publication: (2015). He analyses the parliamentary proceed- Vol. I : De Gasperi was a journalist and a students’ ings of the Italian Chamber of Deputies in the pe- leader. He was active mainly in Trento and in 7 riod 1953-2008 using the TalTac2 software , thus the Austrian Parliament (1901 – 1918). providing a lexical and statistical analysis. An- other project related to our work is “Voci della Vol. II : De Gasperi founded Partito Popolare, be- Grande Guerra” whose online platform allows to came Parliament member in Rome and then explore a corpus of documents related to the first left the Italian political life for several years World War including samples of parliamentary after opposing the Fascist regime, working at proceedings and political speeches (Lenci et al., the Vatican library and as a publicist (1919 – 2016). Similarly to what we present in this pa- 1942). per, such documents have been automatically an- Vol. III : De Gasperi founded the Christian- 7http://www.taltac.it/ Democratic Party, became Prime Minister Document Type Number concepts, that is a weighted list of n-grams rep- Monographs / Prefaces 4 Daily press 963 resenting the most important concepts of a text, Written documents Magazines 228 automatically extracted using KD (Moretti et al., Official documents 433 2015). Electoral / propaganda 473 Speeches Party conferences 188 Institutional venues 419 5 Annotation Evaluation Not specified Not specified 54 We evaluated the quality of the automatic anno- Table 1: Genre labels with corresponding statis- tation produced by TextPro modules on a subset tics. of our corpus. Indeed, since these modules were developed to perform best on contemporary texts, and was Italian delegate at the World War II and typically trained on news, it is important to peace conference (1943 – 1948). assess to what extent they can be reliably used on Italian documents of the XX Century in the polit- Vol. IV : After Christian Democracy led by De ical domain. To this end we manually annotated Gasperi won the first general elections of the a gold standard made of documents written by Italian Republic, he launched a plan of re- De Gasperi between 1906 and 1911 for a total of forms to reconstruct Italy including social 8,872 tokens. We chose texts belonging to the first housing, labor policy and unemployment in- period of De Gasperi’s life because they are the surance (1949 – 1954). oldest in the corpus and therefore the most linguis- tically different from the texts used for training the 4 Annotated Information modules.