A Machine-Learning Approach
Total Page:16
File Type:pdf, Size:1020Kb
Identifying Populist attention in academia for several years now, which has led to a proliferation of different Paragraphs in Text: A definitions and, in many cases, a vague machine-learning approach operationalization and concept-stretching (Pappas 2016). In order to develop an operational definition of 'populism', a 1 Authors: Jogilė Ulinskaitė and Lukas comprehensive literature analysis is 2 Pukelis necessary. Furthermore, this definition should focus on the intrinsic characteristics Abstract: In this paper we present an approach of populism and not depend on the national to develop a text-classification model which context, the register of text, or the author's would be able to identify populist content in text. The developed BERT-based model is largely ideological position. The second challenge successful in identifying populist content in text is to assemble a training dataset for and produces only a negligible amount of False machine-learning models that is large and Negatives, which makes it well-suited as a diverse enough to allow the developed content analysis automation tool, which model to perform well with diverse shortlists potentially relevant content for human previously unseen data. Finally, the third validation. challenge is to validate the performance of the developed model in a way that provides a realistic understanding of how the model Introduction would perform "in the wild," i.e. on new This paper presents our attempt to develop a data that might differ from the training data machine-learning (ML) model to detect in a significant number of ways. populist content in text. If successful, this model could benefit many researchers by In our approach, we define populism automating the most resource-intensive part primarily as a discursive strategy that actors of the research and enabling more extensive across the ideological spectrum can employ. and more ambitious research projects. This We see "populism" as composed of two methodological improvement could enable distinct components - people-centrism more detailed and broader comparative (referring to "the people" as a single entity analyses, leading to a better understanding with homogeneous interests) and anti- of populism. elitism (a sentiment that the current governing elites are corrupt and act against However, as attractive as this may seem, the interests of "the people"). These two there are some critical challenges to components, although sometimes appearing overcome to develop such a model. First, together, are distinct and have been coded the term "populism" has received much separately in our analysis. In addition, we 1 Lecturer Vilnius University Institute of International 2 Data Scientist Public Policy and Management Relations and Political Science (VU IIRPS) Institute (PPMI) ([email protected]) ([email protected]) 1 developed two sets of ML models to detect describes the data and methodology, and the these two dimensions of populism. To train third part presents the results of the model the models, we have developed a new validation. dataset based on the established data sources, where each paragraph of text is 1. Overview of existing research coded as containing or not containing people-centric or anti-elitist sentiment. Research on populism has started with, and for a long time, dominated by, in-depth To validate the model's performance, we analyses of specific cases of populism prepared a separate dataset by manually (Grabow & Hartleb, 2013, Mudde & coding the 2016 and 2020 election Kaltwasser, 2012). Recent research seems manifestos of Lithuanian political parties. to shift focus on broader scale comparative We have carried out the validation to analysis both country-wise, period-wise, simulate a real-life scenario where a and source-wise. Classical content analysis, researcher uses the ML model for a specific started by J. Jagers and S. Walgrave (2007), research project. As we have not used data is still one of the most widely used populist from Lithuania in the original training set, discourse analysis methods. With slight this reduced the risk of contamination when differences between the methods, the data "testing" the model is not new but researchers most often comparethe somehow appeared in the training set. To proportion of populist content by coding further reduce the risk, we split the specific excerpts of texts such as a Lithuanian dataset into two parts and used paragraph (Rooduijn & Pauwels, 2011, one part as a "test" during model Rooduijn, 2014, Pauwels & Rooduijn, 2015, development and the other as a "hold-out" Rooduijn & Akkerman, 2017), a statement once model development was complete. (Ernst et al., 2017, Manucci & Weber, 2017, Ernst et al., 2019, Bernhard & Kriesi, 2019), The developed model performed reasonably an issue-specific claim (Bernhard et al., well on the validation dataset (accuracies of 2015), a sentence (Vasilopoulou et al., 0.86 and 0.95 for people-centrism and anti- 2014) or a quasi-sentence (March, 2018). elitism, respectively). It had a slight Researchers frequently attempt to maintain tendency to over-predict (generate false the validity of the classical content analysis positives), which is acceptable as it is and make the process easier by adding semi- designed to act as an automation aide for automation tools (Caiani & Graziano, 2016, researchers, with human coders checking Ernst et al., 2017, Wettstein et al., 2018, and validating its predictions. Ernst et al., 2019). The paper structure is as follows: the first Since classical content analysis is very time- part presents an overview of existing and labour-consuming, more extensive research and the operational definition of comparative studies involve automated populism used in this paper. The second part methods such as the dictionary-based 2 approach (Pauwels, 2011). Even though the of the populist phenomena across different computer-based coding method's validity is regions and time-frames. somewhat lower than the classical content analysis (Rooduijn & Pauwels, 2011), both Despite various methodological approaches generate reasonably valid improvements and developments, results (Storz & Bernauer, 2018). The automated textual analysis has not yet dictionary-based approach is extensively gained momentum. The exception is an used to analyze both media content attempt by Hawkins and Silva (2018) to use (Hameleers & Vliegenthart, 2020, Gründl, elastic-net regression for the supervised 2020) and party-generated data (Storz & classification of party manifestos. Their Bernauer, 2018, Elçi, 2019, Payá, 2019). results suggest that the model can identify Further developments of the method very populist manifestos and very not- (Bonikowski & Gidron, 2016) and manual populist documents but does not perform check of the text excerpts (Pauwels, 2017) very well on the documents in-between. have been suggested to improve the validity They conclude that using more training data of the dictionary-based approach. could improve results. We also suggest that dividing and hand-coding shorter excerpts Holistic grading is another approach of manifestos (paragraphs) could improve specifically developed to make manual the model. coding more efficient. The method combines the benefits of classical content In recent years, artificial neuron network analysis (holistic approach, human models have demonstrated outstanding interpretation) and dictionary-based results in many spheres of application. approach (ability to compare large amounts Arguably, since 2018 the biggest progress of data). The whole text (usually a speech) has been made in the area of natural is coded by human coders using an explicit language processing, where these models rubric (Hawkins, 2009). The approach have been applied to a number of natural enabled researchers to develop the Global language understanding and text- Populism Database consisting of various classification tasks. Given the magnitude of political texts (Hawkins et al., 2019). these improvements, it is prudent to expect that similar techniques could be also applied Finally, at least several different expert- to improve the classification of the populist based populist databases have been text. established in recent years: Populism and Political Parties Expert Survey (POPPA), 2. Methods The PopuList, The Global Party Survey and Timbro Authoritarian Populism Index. They To develop a machine learning model to categorize populist political actors and recognize populist content, we have provide a more comprehensive perception employed a standard machine-learning workflow: first, we collected and prepared a 3 training dataset used to develop a machine- conceptualizations, we claim that populism learning model (more precisely, an as an ideology is reflected in the discourse ensemble of models). The performance of (Pauwells, 2011). We consider it an the models was tested using a small set of attribute of a text rather than a feature of a manually coded Lithuanian political party politician (Rooduijn, 2014). manifestos. Finally, the performance of the trained model was validated using a more For coding, we follow the instructions extensive, separately coded dataset of all the suggested by (Rooduijn & Pauwels, 2011). Lithuanian political party manifestos from The coding unit is a paragraph, as it allows the 2016 and 2020 parliamentary elections. to distinguish between different arguments This train-test-holdout approach was chosen (Pauwels, 2011) and is a sufficiently long because the more commonly used train-test passage of text