Ethical Considerations of the Germeval20 Task 1. IQ Assessment
Total Page:16
File Type:pdf, Size:1020Kb
Ethical considerations of the GermEval20 Task 1. IQ assessment with natural language processing: Forbidden research or gain of knowledge? Dirk Johannßen Chris Biemann David Scheffer MIN Faculty, MIN Faculty, Faculty of Economics Dept. of Computer Science Dept. of Computer Science NORDAKADEMIE Universitat¨ Hamburg Universitat¨ Hamburg 25337 Elmshorn, Germany & Nordakademie 22527 Hamburg, Germany http://lt:informatik:uni-hamburg:de/ fbiemann, [email protected] [email protected] Abstract is what the intelligence test measures’ suggests (Maas et al., 2014), whilst most definitions at least The use of Intelligence Quotient (IQ) testing as a measure for intellectual ability is controver- agree that intelligence is always connected to suc- sial. Even though IQ testing is considered to cessfully overcoming challenges of everyday life be among the most valid measures of psychol- situations (Rost, 2009). ogy, findings and current research sparked a Since those aspects already are concerns, any debate over racial or socioeconomic biases, as work that involves automation by the use of NLP well as the label of ‘pseudoscience’ for many techniques and IQ testing information based on situations that involve IQ testing. The Ger- natural language texts understandably raises con- mEval20 Task 1 asked researchers to inves- tigate NLP approaches for approximating the cerns. In case of the GermEval20 Task 1, the re- ranking based on IQ and high school scores search has caused more than just concerns, but a from so-called implicit motive texts. Quickly, so-called shitstorm on the social platform Twitter1. a vivid discourse on whether the shared task should be viewed as unethical and forbidden was held within the NLP community. In this paper, we investigate ethical considerations and arguments against and in favor of such a task and come to the conclusion that this type of research should be conducted despite the undoubtedly associated ethical issues, as it can shed light on thus-far non-transparent methods and offers valuable gains of knowledge. Figure 1: Images to be interpreted by participants uti- 1 Introduction lized for the operant motive test (OMT) on the left and the motive index (MIX) on the right. The motives The use of Intelligence Quotient (IQ) testing as are the affiliation motive (A), the power motive (M), a measure for intellectual ability is highly con- achievement (L), and freedom (F).A 0 stays for the zero troversial. On the one hand, different IQ mea- / unassigned motive. sures have been established by psychologists more than a century ago and are held to be among The GermEval20 Task 1 on the Classification the most valid, stable, and reliable measures in and Regression of Cognitive and Emotional Style the whole scientific field of psychology (Benson, from Text (Johannßen et al., 2020)23 was the stum- 2003). Cognitive abilities are influential predic- bling block for a heated debate on this topic. tors for multiple criteria for professional success In short, the task is about researching the valid- (Schmidt and Hunter, 1998; Kramer et al., 2009; ity of so-called implicit motives and their poten- Ones et al., 2017). On the other hand, recent stud- tial to substitute controversial metrics utilized in ies have shown that many contucted IQ tests are 1https://www:twitter:com fundamentally flawed as to introduce racial or so- 2GermEval is a series of shared task evaluation campaigns cioeconomic biases (Turkheimer et al., 2003). that focus on Natural Language Processing (NLP) for the Even the term intelligence is ambiguous as well German language. The workshop is held as a joint Confer- ence SwissText & KONVENS 2020 in Zurich.¨ as the assumed impacts IQ testing has on aptitude 3https://competitions:codalab:org/ diagnostics, as the title of the paper ’intelligence competitions/22006 30 aptitude diagnostics. Metrics like IQ tests, high 2 The role of implicit motives for the school grades, or math tests are commonly used in GermEval20 Task 1 aptitude diagnostics, but can quickly be utilized in Researchers have found indications that linguistic inherently flawed settings. Research on implicit features such as function words used in a prospec- motives has shown, that they are indicators for tive student’s writing perform better in predicting long-time behavior and development, which could academic development (Pennebaker et al., 2014) replace the other metrics (Scheffer, 2004). For than other methods such as GPA values. those implicit motives, participants are asked to The purpose of the GermEval20 Task 1 was to answer questions to ambiguous drawings (Schef- investigate, whether firstly, implicit motives can be fer and Kuhl, 2013). Those association tasks are classified on a human level and whether secondly, less socioeconomical or racial biased and rather those implicit motives are sufficient for compen- show intrinsic desires than task comprehension sating flawed predictors utilized during aptitude (Schultheiss, 2008, p. 439 ff.). However, as there diagnostical evaluations, such as GPA or IQ scores is no clear and unambiguous rule system for label- (Johannßen et al., 2020). ing those implicit motives, they have yet to be un- During an aptitude test, participants are asked derstood better (Johannßen and Biemann, 2019). to write freely associated texts to provided ques- This is, what the shared task aimed: a better under- tions on shown images (such as those displayed standing of implicit motives and their role in apti- in Figure1). Psychologists can identify so-called tude diagnostics by the use of Natural Language implicit motives from those expressions. Implicit Processing (NLP) methods. motives are unconscious intrinsic desires that The task contains two subtasks. For Subtask 1, can be measured by implicit methods (Gawron- participants are asked to reproduce a ranking of ski and De Houwer, 2014; McClelland et al., students based on different high school grades and 1989) Those implicit psychometrics are said to IQ scores solemnly from implicit motive texts. For be predictors of behavior and long-term develop- Subtask 2, participants are asked to classify each ment from utterances (McClelland, 1988; Schef- motive text into one of 30 classes as a combination fer, 2004; Schultheiss, 2008). of one of five implicit motives and one of six levels From a small sample of an aptitude test col- (Johannßen et al., 2020). lected at a college in Germany, the classification and regression of cognitive and motivational styles During the heated debate on the shared task, from a German text can be investigated (Johan- moderate critics of the task called for the organiz- nßen and Biemann, 2019). ers to ’pull the plug’, others went as far as com- paring the task with Eugenics in the Third Reich 3 Introductionary words on IQ testing (even though those critics were clearly out of line IQ testing is a form of psychometrical testing, with the constructive and fair debate held by most mostly utilized in the area of aptitude diagnos- researchers). tics and structural assessments. The term intel- Three main debated upon topics emerged from ligence itself is debated upon, as well as those the debate, which will be discussed in this paper: types of tests themselves (see Section5). Nowa- i) IQ testing and biases, ii) forbidden research, and days, IQ testing is in an academic crisis, forming iii) reasons for even building such a system. a paradigm shift towards more precise tests of iso- lated, single skills rather than one defining metric, In Section2, we will first describe the shared dissolving misconceptions of IQ testing as being task in more detail and provide some background a sort of personality trait or fixed property of an information of IQ testing in Section3. A discourse individual. as to why this is an ethical question and the course Problems and questions from IQ tests vary of the emerged Twitter shitstorm and heated de- greatly and range from the use of language, bate is presented in Section4. The three points of proverbs, mathematics to causalities or mechani- discussions are in Section5 (i, IQ and bias), Sec- cal problems, as displayed in Figure2. tion6 (ii, reasons for building such a system), and Attention, auditory, visual and tactile percep- 7 (iii, is there forbidden research). A final discus- tion, language, memory, and executive function sion in Section8 concludes. all need to be considered when assessing the IQ 31 (Christie, 2005). One of the most comprehensive cerned with how humans should live and what is IQ tests is the Wechsler test. It samples verbal considered to be just or unjust. Morals, on the and non-verbal areas of intellectual functioning other hand, are normative and rather connected (Wechsler, 2011). to the activities performed by humans under the For each IQ test, the mean of all participants premise of ones’ ethics. Whilst philosophical marks 100 testing points, with the standard devia- ethics define a set of ground truths about how to tion adding or subtracting 15 points. As a result, live in a just way, moralities are implications on 68% of a population range between 85 to 115 IQ what to act out in certain situations. testing points. Since IQ tests can only discriminate abilities be- Moral dilemmas are issues of conflicts between tween the percentiles from 3 percent to 97 percent what a person should do and should not do. In of the tested population, very weak or very strong those cases, where there is neither a choice of individuals can not be identified. IQ testing may clearly right nor wrong action in a moral situation, only be valid by identifying and testing a truly rep- the dilemma becomes present (Braunack-Mayer, resentative peer group. The more homogeneous 2001). the peer group, the more valid the test. As humans are rich in diversity, this criterion is hardly accom- As will be described in Subsection 4.1, there is plishable in practice and should thus be accounted a necessity for aptitude testing in nowadays hu- for when interpreting scores.