Ethical considerations of the GermEval20 Task 1. IQ assessment with natural language processing: Forbidden research or gain of knowledge? Dirk Johannßen Chris Biemann David Scheffer MIN Faculty, MIN Faculty, Faculty of Economics Dept. of Computer Science Dept. of Computer Science NORDAKADEMIE Universitat¨ Hamburg Universitat¨ Hamburg 25337 Elmshorn, Germany & Nordakademie 22527 Hamburg, Germany http://lt.informatik.uni-hamburg.de/ {biemann, johannssen}@informatik.uni-hamburg.de [email protected]

Abstract is what the intelligence test measures’ suggests (Maas et al., 2014), whilst most definitions at least The use of (IQ) testing as a measure for intellectual ability is controver- agree that intelligence is always connected to suc- sial. Even though IQ testing is considered to cessfully overcoming challenges of everyday life be among the most valid measures of psychol- situations (Rost, 2009). ogy, findings and current research sparked a Since those aspects already are concerns, any debate over racial or socioeconomic biases, as work that involves automation by the use of NLP well as the label of ‘pseudoscience’ for many techniques and IQ testing information based on situations that involve IQ testing. The Ger- natural language texts understandably raises con- mEval20 Task 1 asked researchers to inves- tigate NLP approaches for approximating the cerns. In case of the GermEval20 Task 1, the re- ranking based on IQ and high school scores search has caused more than just concerns, but a from so-called implicit motive texts. Quickly, so-called shitstorm on the social platform Twitter1. a vivid discourse on whether the shared task should be viewed as unethical and forbidden was held within the NLP community. In this paper, we investigate ethical considerations and arguments against and in favor of such a task and come to the conclusion that this type of research should be conducted despite the undoubtedly associated ethical issues, as it can shed light on thus-far non-transparent methods and offers valuable gains of knowledge. Figure 1: Images to be interpreted by participants uti- 1 Introduction lized for the operant motive test (OMT) on the left and the motive index (MIX) on the right. The motives The use of Intelligence Quotient (IQ) testing as are the affiliation motive (A), the power motive (M), a measure for intellectual ability is highly con- achievement (L), and freedom (F).A 0 stays for the zero troversial. On the one hand, different IQ mea- / unassigned motive. sures have been established by psychologists more than a century ago and are held to be among The GermEval20 Task 1 on the Classification the most valid, stable, and reliable measures in and Regression of Cognitive and Emotional Style the whole scientific field of (Benson, from Text (Johannßen et al., 2020)23 was the stum- 2003). Cognitive abilities are influential predic- bling block for a heated debate on this topic. tors for multiple criteria for professional success In short, the task is about researching the valid- (Schmidt and Hunter, 1998; Kramer et al., 2009; ity of so-called implicit motives and their poten- Ones et al., 2017). On the other hand, recent stud- tial to substitute controversial metrics utilized in ies have shown that many contucted IQ tests are 1https://www.twitter.com fundamentally flawed as to introduce racial or so- 2GermEval is a series of shared task evaluation campaigns cioeconomic biases (Turkheimer et al., 2003). that focus on Natural Language Processing (NLP) for the Even the term intelligence is ambiguous as well German language. The workshop is held as a joint Confer- ence SwissText & KONVENS 2020 in Zurich.¨ as the assumed impacts IQ testing has on aptitude 3https://competitions.codalab.org/ diagnostics, as the title of the paper ’intelligence competitions/22006

30 aptitude diagnostics. Metrics like IQ tests, high 2 The role of implicit motives for the school grades, or math tests are commonly used in GermEval20 Task 1 aptitude diagnostics, but can quickly be utilized in Researchers have found indications that linguistic inherently flawed settings. Research on implicit features such as function words used in a prospec- motives has shown, that they are indicators for tive student’s writing perform better in predicting long-time behavior and development, which could academic development (Pennebaker et al., 2014) replace the other metrics (Scheffer, 2004). For than other methods such as GPA values. those implicit motives, participants are asked to The purpose of the GermEval20 Task 1 was to answer questions to ambiguous drawings (Schef- investigate, whether firstly, implicit motives can be fer and Kuhl, 2013). Those association tasks are classified on a human level and whether secondly, less socioeconomical or racial biased and rather those implicit motives are sufficient for compen- show intrinsic desires than task comprehension sating flawed predictors utilized during aptitude (Schultheiss, 2008, p. 439 ff.). However, as there diagnostical evaluations, such as GPA or IQ scores is no clear and unambiguous rule system for label- (Johannßen et al., 2020). ing those implicit motives, they have yet to be un- During an aptitude test, participants are asked derstood better (Johannßen and Biemann, 2019). to write freely associated texts to provided ques- This is, what the shared task aimed: a better under- tions on shown images (such as those displayed standing of implicit motives and their role in apti- in Figure1). Psychologists can identify so-called tude diagnostics by the use of Natural Language implicit motives from those expressions. Implicit Processing (NLP) methods. motives are unconscious intrinsic desires that The task contains two subtasks. For Subtask 1, can be measured by implicit methods (Gawron- participants are asked to reproduce a ranking of ski and De Houwer, 2014; McClelland et al., students based on different high school grades and 1989) Those implicit psychometrics are said to IQ scores solemnly from implicit motive texts. For be predictors of behavior and long-term develop- Subtask 2, participants are asked to classify each ment from utterances (McClelland, 1988; Schef- motive text into one of 30 classes as a combination fer, 2004; Schultheiss, 2008). of one of five implicit motives and one of six levels From a small sample of an aptitude test col- (Johannßen et al., 2020). lected at a college in Germany, the classification and regression of cognitive and motivational styles During the heated debate on the shared task, from a German text can be investigated (Johan- moderate critics of the task called for the organiz- nßen and Biemann, 2019). ers to ’pull the plug’, others went as far as com- paring the task with Eugenics in the Third Reich 3 Introductionary words on IQ testing (even though those critics were clearly out of line IQ testing is a form of psychometrical testing, with the constructive and fair debate held by most mostly utilized in the area of aptitude diagnos- researchers). tics and structural assessments. The term intel- Three main debated upon topics emerged from ligence itself is debated upon, as well as those the debate, which will be discussed in this paper: types of tests themselves (see Section5). Nowa- i) IQ testing and biases, ii) forbidden research, and days, IQ testing is in an academic crisis, forming iii) reasons for even building such a system. a paradigm shift towards more precise tests of iso- lated, single skills rather than one defining metric, In Section2, we will first describe the shared dissolving misconceptions of IQ testing as being task in more detail and provide some background a sort of personality trait or fixed property of an information of IQ testing in Section3. A discourse individual. as to why this is an ethical question and the course Problems and questions from IQ tests vary of the emerged Twitter shitstorm and heated de- greatly and range from the use of language, bate is presented in Section4. The three points of proverbs, mathematics to causalities or mechani- discussions are in Section5 (i, IQ and bias), Sec- cal problems, as displayed in Figure2. tion6 (ii, reasons for building such a system), and Attention, auditory, visual and tactile percep- 7 (iii, is there forbidden research). A final discus- tion, language, memory, and executive function sion in Section8 concludes. all need to be considered when assessing the IQ

31 (Christie, 2005). One of the most comprehensive cerned with how humans should live and what is IQ tests is the Wechsler test. It samples verbal considered to be just or unjust. Morals, on the and non-verbal areas of intellectual functioning other hand, are normative and rather connected (Wechsler, 2011). to the activities performed by humans under the For each IQ test, the mean of all participants premise of ones’ ethics. Whilst philosophical marks 100 testing points, with the standard devia- ethics define a set of ground truths about how to tion adding or subtracting 15 points. As a result, live in a just way, moralities are implications on 68% of a population range between 85 to 115 IQ what to act out in certain situations. testing points. Since IQ tests can only discriminate abilities be- Moral dilemmas are issues of conflicts between tween the percentiles from 3 percent to 97 percent what a person should do and should not do. In of the tested population, very weak or very strong those cases, where there is neither a choice of individuals can not be identified. IQ testing may clearly right nor wrong action in a moral situation, only be valid by identifying and testing a truly rep- the dilemma becomes present (Braunack-Mayer, resentative peer group. The more homogeneous 2001). the peer group, the more valid the test. As humans are rich in diversity, this criterion is hardly accom- As will be described in Subsection 4.1, there is plishable in practice and should thus be accounted a necessity for aptitude testing in nowadays hu- for when interpreting scores. man interactions, may it be that a scholar is to be Furthermore, the cultural and environmental ex- chosen, a new employee or a high school grade. posures of peer groups and individuals are crucial. Psychologists face the challenge that many cogni- For many international IQ tests, only individuals, tive processes are difficult to measure and to ob- that were exposed to representative use of the En- serve. Oftentimes, psychologists have to rely on glish language at all stages of life may be compa- behavioral observations or questionnaires. How- rable to other peers. ever, any consciously given response can never re- Aptitude diagnosticians call for IQ tests to be veal unconscious desires, which is why e.g .im- utilized only to identify possible child weaknesses plicit motives (see Section1) might be valuable. to purposefully support them with additional edu- One downside to the use of implicit motives is cational offers. (Christie, 2005) that they yet have to be fully understood and re- searched in terms of their validity. Many metrics are yet not fully understood, even though psychol- ogists are confident in their explanatory powers.

IQ tests have been researched for more than 100 years and are said to be among the most stable, valid, and reliable metrics in psychology (Ben- son, 2003). However, recent research has clearly shown that there is hardly any performed IQ test- ing has been done without introducing harmful bi- Figure 2: Different parts from an IQ test utilized at the ases (see Section5). As standardized tests are the Nordakademie. Upper left: logical comprehension, up- closest feasible form of aptitude diagnostics, they per right: memory skills, lower left: technical compre- can hardly be completely omitted either. This is a hension, lower right: linguistic comprehension. strong moral dilemma.

This very dilemma caused for the GermEval20 4 Discourse – why is this an ethical Shard Task 1 so eagerly debated upon. It bears question? the chance of learning more about implicit mo- tives, which even could compensate IQ testing, re- When there is a need for discussing the ethical placing it completely. However, any research on considerations of a shared task, there has to be a IQ testing-related data bears the danger of biases, moral dilemma in the first place. Philosophical pseudoscience, and misuse of approaches or re- ethics is a branch of philosophy, which is con- sults.

32 4.1 Aptidude diagnostics and IQ testing in 4.2 Heated Community Discussion the NLP research community On December 4, 2020, the first call for participa- tion was released on multiple channels, one be- When it comes to working with natural language ing the corpora list5. The original call was entitled processing methods in combination with IQ tests, ’GermEval 2020 Task 1 on the Prediction of Intel- there are different disciplines. lectual Ability and Personality Traits from Text: One of the most broadly researched disciplines 1st Call for Participation’. It described the task is benchmarking artificial intelligence (AI) sys- in very technical terms with a focus on the NLP tems by their capabilities of scoring well on methodology. human-intended IQ tests. Even though data re- A first reaction on the Corpora List considered sources for performing those AI benchmarks are that “[a]s a community, we should think carefully limited, small, and poorly standardized (Liu et al., about whether it is appropriate to work with IQ test 2019), there are still many experiments on them. results as data, and what the applications of this re- The goals of those benchmarks include advancing search might be.” and “In the , there AI, validating AI systems, investigate intelligence is considerable evidence that IQ tests are racially testing further, and to understand better what hu- biased”6. man intelligence is. Besides the first topic of discussion, biases in Another discipline is more closely linked to hu- IQ testing, the direct response to this introduced man behavior and tries to correctly classify or hu- the second topic of discussion: “This task seems man performances on IQ or comprehension tests irresponsible/poorly conceived to me. Before de- according to the defined measures of success of signing such a task, I think it is imperative to con- the test. Different from the benchmarking, where sider its use cases: When and why would we want the IQ test itself is the research object, this dis- to predict IQ scores or high school grades from the 7 cipline always involves the IQ test and human text?” . performance. The shared task provides data for The discussion continued in an argumentative such studies. Another example of such a task is manner and a respectful tone. There were roughly the Automated Student Assessment Prize (ASAP) equal amounts of supporters and opponents of Short Answer (SA) challenge.4 The ASAP-SA the task and even though many assumptions were was conducted by the Hewlett Foundation and made – e.g. of whether the data was provided vol- aimed for predictions of students’ grades based untarily by the aptitude test participants – quickly, on short answers given by those students. Even panelists came to a conclusion, that too little of though there were ethical discussions beforehand the underlying circumstances of the task was de- and during the task, the organizers specifically dis- scribed in the first call, to sensibly continue the missed those and asked for methodology papers discussion. only. Regarding the ethical concerns, the orga- In the meantime, the discourse continued on an- nizers argued that standardized tests occupy pro- other channel: Twitter. As shown in Figure3, con- fessionals, which could, were they not manually cerns were raised by an initial tweet. evaluating those tests, able, to craft more sophisti- Other than the well-formulated and balanced cated, more individual and more insightful testing concerns on the Corpora List, tweets are usually procedures. much shorter. Many researchers asked for more The third discipline is a meta-analysis on cir- details, others drifted into speculations on not- cumstances connected with NLP on IQ tests. provided details of the task’s circumstances. As Those studies include e.g. bias research, statistical the tone got increasingly hostile and demands for evaluations, or behavioral consequences. Ethical ’pulling the plug’ and starting petitions against considerations, such as this work, can be included 5https://mailman.uib.no/public/corpora/ in that discipline as well (Tsvetkov et al., 2018; 2019-December/ Hovy and Spruit, 2016). 6https://mailman.uib.no/public/corpora/ 2019-December/030882.html By Jacob Eisenstein 7https://mailman.uib.no/public/corpora/ 2019-December/030883.html by Emily M. Bender 4https://www.kaggle.com/c/asap-sas/ 8A message on the short messaging service Twitter is overview/description called a tweet and can be up to 240 characters in length.

33 ganizers published an explanatory public reaction to the heated discussion on their accompanying task website, on a Codalab competition website, as well as on Twitter11. The heated discussion quickly died down after Figure 3: A first tweet carried the discussion from the the 6th of December. The task organizers revised NLP specific corpora list to the more visible and in- ternational social plattform Twitter, raising concerns their task website to include more detailed infor- about the shared task.8 mation on the tasks’ circumstances, ethical con- sideration and changed the name of the task from ’Prediction of Intellectual Ability and Personal- it arose, the task organizers posted a request for ity from Text’ to ’Classification and Regression of some time to formulate a call revision. This re- Cognitive and Motivational Style from Text’, as vision was released on the 5th of December, one this is by far more precise in terms of the tasks’ day after the 1st call of participation, the organiz- goals (as described in Section1). ers released a statement, clarifying motivations for this task and explaining some of the task’s circum- 5 IQ testing and biases stances9. Meanwhile, the shitstorm on Twitter had This section will explore and discuss the first of escalated up to the lowest point of comparisons to the three topics of discussions that emerged from Nazi Germany and Eugenics, as displayed in Fig- the discourse: i) IQ testing and biases. Effects like ure4. measurable biases and a training effect will be ad- dressed, as well as measures taken against those biases and a general discussion on systems theory.

5.1 There is a socioeconomical bias in IQ testing Minorities can be discriminated by a biased due to unequal environmental circumstances and mea- surements in non-representative groups (Rushton and Jensen, 2005). Firstly, the term intelligence in intelligence test- Figure 4: The so-called shitstorm on Twitter went so ing is highly misleading, as there is no well- far as there were calls for petitions against the task, demands for ’pulling the plug’, as well as Nazi Ger- defined concept of intelligence. Rather than in- many and Eugenics comparisons. However, many par- telligence, IQ tests are said to measure the skill of ticipants strived for a sensible, respectful and construc- the specific tasks employed in an IQ test. tive discussion. In what IQ scores measure, they are thought to be one of the best measures in the scientific field of psychology. Those skills are often, what aptitude Two days after the published 1st call for par- diagnostics define as relevant for many modern ticipation, the third topic of discussion emerged skill-oriented jobs (validity). They furthermore from the discourse: forbidden research. One com- stay relatively stable across different stages of life, munity researcher wrote a Medium article on the starting with the age of 8, and with time (stabil- topic with the title ’Is there research that shouldn’t ity). The stay consistent when performed with the be done? Is there research that shouldn’t be en- same setups but different observers and observees couraged?’ on the 7th of December10. The or- (reliability). Lastly, parts of IQ tests correlate with 9https://www.inf.uni-hamburg.de/en/inst/ each other, suggesting an inference of both, envi- ab/lt/resources/data/germeval-2020- cognitive-motive/germvval2020-task1- ronmental and genetic factors (Plomin and Deary, public-statement.txt 2015). 10https://medium.com/@emilymenonbender/ is-there-research-that-shouldnt-be- 11https://www.inf.uni-hamburg.de/en/inst/ done-is-there-research-that-shouldn- ab/lt/resources/data/germeval-2020- t-be-encouraged-b1bf7d321bb6 by Emily M. cognitive-motive/germvval2020-task1- Bender. public-statement.txt

34 However, there are strong controversial signs they would if they did not take a test yet. In other of this genetic pool thought. The Flynn effect, words, IQ tests are trainable. The more often par- which states that the IQ scores among the popu- ticipants perform IQ tests that are thought to be the lation grow by 3 points each decade, by far too norm of a given point in time (in reference to the fast for it to be connected to evolution, shows this. Flynn effect), the higher participants are thought This effect is rather caused by skills acquired from to score on those tests. environmental changes (Flynn, 1987). E.g. When However, this implication does not just hold investigating the development of refugees, their for the IQ tests themselves but goes further. As early environmental circumstances are subopti- stated before, IQ tests do not test intelligence but mal. However, later positive environmental fac- skill. As those skills are connected with many tors can compensate for those early difficulties skills required by modern jobs and with many (Dweck, 2017). skills trained and taught in schools – e.g. to think Not so much race, but socioeconomics is in abstract terms, pattern recognition, and basic thought to be the difference (Turkheimer et al., math skills like the Fibonacci series), well edu- 2003). It shows that IQ potential is determined cated and often home-schooled individuals tend to by genes, but whether this potential develops is train skills more and thus perform better on intel- dependent on the environment (who is rich can ligence testing procedures, even though high skills achieve anything). Thus, only if it was ensured on any non-related task can not be guaranteed. that both individuals enjoyed the same good envi- ronment, an IQ score would truly say anything. 5.3 IQ Tests Discriminate Minorities IQ tests are good measures of innate skill abil- It is this bias, which leads to unequal opportuni- ity if all other factors are held steadily, which is, ties especially in countries where there is a rich in fact, impossible. The differences in IQ scores diversity among the population. Intelligence test- across minorities and the majority, as it is present, ing has had a dark history. Eugenics during the points to a very serious issue: Inequality of op- great wars e.g. in the US by sterilizing citizens 12 portunity. It is this socioeconomic bias, which or in Germany during the Third Reich and Eugen- leads to unequal opportunities especially in coun- ics (Reddy, 2007) are some of the most gruesome tries where there is a rich diversity amongst the parts of history. population. But even in modern days, the IQ is misused. Re- cently, IQ scores have been used in the US to de- 5.2 Standardized tests are trainable termine which death row inmate shall be executed As stated before, terms like intelligence or cogni- and which might be spared. Since IQ scores show tive ability are misleading, when describing apti- a too large variance, the Supreme Court has ruled tude diagnostics and IQ testing. Those tests mea- against this definite threshold of 70 (Cooke et al., sure a pre-defined set of skills. 2015). However, Sanger (2015) has researched In the area of psychology, there are certain prin- an even more present practice of ’racial adjust- cipals for experiments and procedures, that have ment’, adjusting the IQ of minorities upwards to developed over many years. One of those princi- take countermeasures on the racial bias in IQ test- pals is that participants always need to know what ing, resulting in death row inmates, which orig- is being tested beforehand. It needs to be clear, inally were below the 70 points threshold, to be what results of a procedure or test are, and how executed. those results relate to the participants. There is an ethical necessity to carefully view, Another principle is that most experiments and understand and research the way intelligence test- tests in psychology are only truly objective, if par- ing is conducted and how those scores are – if at ticipants did not yet complete this particular test all – correlated with what we understand as ’in- and do not know the exact type of testing that the telligence’, as they might be mere cognitive and whole experiment or parts of it will set up. motivational styles. Further valuable research can As soon as there is knowledge or accustoming be conducted to investigate connections between of the underlying testing procedure, participants other personalities. Racial biases are measurable, will most likely perform more in the direction of 12https://supreme.justia.com/cases/ what they think would maximize a reward, then federal/us/274/200/

35 variances are great and many critics state that IQ scores reflect upon skill or cognitive and motiva- tional style rather than real intelligence as it is broadly understood.

5.4 Wrong Wording: ’Cognitive ability’ As stated before, the term intelligence is highly misleading. Furthermore, for the first subtask, we were trying to reproduce, what is being utilized as a selection mechanism, where the IQ testing is only part of the test. High school grades are an- Figure 5: The exemplary 2014 year of graduation other, matched to implicit motive tests. For the from the NORDAKADEMIE illustrates the cul- second subtask, implicit motive texts were to be tural homogenity, as the vast majority of graudates classified directly. are white (https://www.shz.de/lokales/ Both of those tasks aimed for researching an un- elmshorner-nachrichten/lasst-die- derlying pattern or truth to implicit motives, which huete-fliegen-id19354606.html). In Ger- have been researched and show no inherent bias. many, a strongly biased socioeconomical filter is already present at the high school level. Furthermore, high school grades in Germany con- sist of 60% participation in class. Thus the whole term cognitive ability in the 1st call for participa- tion of GermEval20 Task 1 was nonsensical and gated. Aptitude diagnosticians have spent decades plain wrong but was revised quickly after first con- to challenge and correct the strong socioeconomic cerns were raised. biases, that were present in most of the earliest IQ tests. Nowadays, there are many different variants 5.5 Relation to German socioeconomic and approaches of IQ testing. structure In Germany, there is little diversity amongst pri- There is a broad understanding that intelligence vate college applicants. Even though researchers testing is especially prone and biased towards the at the NORDAKADEMIE try to actively chal- environments and circumstances in which they lenge those socioeconomic biases by employing were developed. As a result, the tests designed in implicit motives, that are known to be less biased Western, white societies are problematic when uti- than other metrics in the field of aptitude diagnos- lized for testing richly diversified cultures (Vahidi, tics, the employed IQ test also accounts for the lit- 2015). tle diversity of the participant population. Unfortunately, the German education system is known to have a strong socioeconomic bias, which The NORDAKADEMIE utilized the IST 200 leads to a vast under-representation of people with R intelligence structural test by Liepmann et al. a migration background in higher education (Diehl (2007), which was normalized on high school 13 and Fick, 2016; Fernandez-Kelly, 2015). graduates . Since only about a third of students This, paradoxically, leads to the data of the Ger- attend high school in Germany, the basic popu- mEval20 Task 1 being less prone to the influence lation of this IQ test accounts for the little diver- of such biases, with respect to the ground popu- sity of most applicants at the NORDAKADEMIE, lation of the underlying data. Even though there which already experienced a socioeconomic filter. is no information on names, nationalities, or other Even though this filter is a discrimination already, demographic data, as it is forbidden in Germany the employed IQ test objectively accounts for the to record personal information according to EU type of the basic population that takes the test and laws. However, e.g. pictures of graduation years thus challenges this bias. indicate that there is little diversity amongst those graduates, as displayed in Figure5. 13In Germany, the secondary school tier consists of three types of schools: The Hauptschule (practice-oriented voca- 5.6 Challenging IQ testing biases in Germany tional education), the Realschule (theory-oriented vocational education) and the Gymnasium (high school, preparations for During the long history of research on the field of pursuing a college education). Only about 30% of graduates IQ testing, many mistakes were made and investi- go to college in Germany (Fernandez-Kelly, 2015)

36 5.7 Systems theory by Luhmann: 6.1 Short text classification is difficult and communication is the relation of systems vague and none can be transmitted without Short text classification is a very difficult task. The noise most widely used method is the keyphrase extrac- tion (Zhang et al., 2018). However, the implicit The systems theory by Luhmann is a philosophical motive theory asks for annotators to examine the and sociological communication theory, that de- narratives of texts rather than single keywords or scribes agents of an environment not as instances keyphrases (Scheffer and Kuhl, 2013). Thus, the but in their relations to other agents. Communi- most promising method for short text classifica- cation, according to Luhmann, is the constructing tion is not applicable for implicit motives texts, principle of an environment and not just a mere and therefore, it is doubtful that the mere focus on tool. An agent is understood as an autonomous an automatic classification procedure creates valid part of this environment, which offers its inner results. structure as a matter of communication to other agents (Gorke¨ and Scholl, 2006). 6.2 The task reflects an established practice However, as the channel model of communica- in Europe tion by Shannon (1948) describes, there is no com- It can be debated upon, whether researchers munication between agents (sender and receiver) should focus on theoretical tasks or if a very prac- without being obscured and disturbed by noise. tical focus is legit. The NORDAKADEMIE is One environment or system is science. Every a university for applied sciences and the whole scientific discipline can be described as an agent context of the GermEval20 Task 1 aims for re- in this environment. Whenever there is incomplete searching implicit motives in the very application- knowledge of the inner state of an agent, any type oriented field of aptitude diagnostics14. of communication between those systems gets ob- Even if there are very good and strong ar- scured by noise and thus assumptions of those in- guments against aptitude diagnostics, assessment ner states can range from approximations to mere centers, the consideration of socioeconomically guesses. In any case, the assumptions are flawed. biased high school grades or personal job inter- views, it is a very common practice in Germany Applied to the GermEval20 Task 1 and the eth- and Europe to examine all of those approaches for ical dilemma of IQ testing at hand, it can be decisions on whom to employ. stated, that since the scientific field of applied NLP Mainly companies in Europe employ IQ tests does not comprehend the inner state of the scien- for selecting capable applicants. In the United tific field of psychology and aptitude diagnostics, Kingdom, roughly 69 percent of all companies uti- assumptions of the implications, limits, and ef- lize IQ. In Germany, the estimate is 13 percent fects of IQ testing from any non-psychological re- (Nachtwei and Schermuly, 2009). searcher must be viewed with caution – especially, if no correspondence has been undergone, as truth Academia has the responsibility to research the is the interaction between correspondence, con- benefits of society. Even though the organizers of sensus and consistency (Sahakian and Sahakian, the GermEval20 Task 1 do not focus on IQ test- 1993). ing but the implications of implicit motives, since IQ testing is part of the conducted practice in Ger- many and Europe, there is an academic respon- 6 Resons for building a system that sibility to research its implications. Furthermore, GermEval20 Task1 proposes science nowadays is called upon making efforts to- wards findings that are closely related to everyday society, as Bormann (2013) points out. This section deals with the topic ii) of discussion: why should such a potentially dangerous system, 6.3 Purpose of the task as proposed by the GermEval20 Task 1 organizers, Especially the early reactions of the shitstorm de- be build in the first place? Difficulties and possible scribed in Section 4.2 were shaped by misconcep- misuses are the subjects of this section, as well as tions, incomplete information and misleading as- some background on the implications of research- ing implicit motives. 14https://idw-online.de/de/news492748

37 sumptions what the GermEval20 Task 1 was about 7 Forbidden research and what was it not about:15 This section explores the topic of discussion iii): Is there research that should not be done? Is there “[...] I would worry about any research project forbidden research? As these are questions for whose organisers chose to include ”prediction of broad fundamental debates, we will only focus on intellectual ability” in the very title. Presumably a those aspects, that appear most connected to the careful choice for a big research project. [...]” GermEval20 Task 1. 7.1 Knowledge is not harmless This becomes apparent as some Tweets raised The first general principle to acknowledge is that concerns mostly based on the headline of the 1st knowledge is not harmless. There are many ex- call rather than its content, whereas the 1st call amples of theoretical research being utilized for paired with provided websites did not include destructive follow-up research or dangerous uten- much of background information either. sils directly. Exemplarily, Alfred Nobel did not However, as the organizers stated in their task intently dynamite to be used for war, but rather for companion paper (Johannßen et al., 2020) and mining. Historians assume that Nobel included a more prominently on their revised companion peace dedicated Nobel Price in his last will is due website: “Any research performed on this aptitude to his invention being misused for war purposes17. test or the annually conducted assessment cen- This is an example of a so-called dual use of ter (AC) at the NORDAKADEMIE is under the inventions. When inventions intentioned for civil premise of researching methods of supporting hu- uses is misused without the consensus of the in- man resource decision-makers, but never to create ventor for military purposes, this is called dual fully automated, stand-alone filters”16. use. Williams-Jones et al. (2014) describe dual use The defined goal of the GermEval20 Task 1 more generally as being used for good and bad ei- Subtask 1 was to reproduce a ranking of stu- ther intentionally or unintentionally by the inven- dents based on the sum of z-standardized high tors. school grades and IQ scores solemnly based on Furthermore, the authors describe the dilemma provided implicit motive texts (Johannßen et al., of this dual use, as there is rarely any impactful 2020). Those ranks were not calculated to indicate research that could not be considered dual use. the superiority of single individuals over others. Most meaningful findings could be utilized for the From an aptitude diagnostical view, this would not good and the bad. Moreover, at times it is not make sense. E.g. a student that might achieve a even possible to imagine the negative or bad dual high IQ score and high German high school grades use of one’s inventions, as further research has not but worse math and English grades might have a been conducted yet and novel products have yet higher overall rank compared to a student whose not been seen (Williams-Jones et al., 2014). metrics are all above average but without any es- One infamous example of dual use that was pecially high ones. Yet companies might prefer not necessary imaginable is nuclear energy and its someone who is above average in every aspect characteristics, which has to lead to a lot of scien- over anyone, that might have high grades in one tific progress (e.g. research on cancer treatments), subject but very low ones in other (Hell et al., civilian use (e.g. nuclear energy), but also great 2007). destructions and threats (e.g. nuclear weapons) Moreover, the critics of this shared task and (Tucker, 2012, p. 74 ff.). the organizers themselves have criticized the broad 7.2 NLP can easily be misused for consideration of IQ scores and high school grades pseudoscience in Germany and the EU, as they discriminate against minorities with their socioeconomic bias. IQ scores are prone to pseudoscientific settings and are not easily distinguishable from serious 15https://mailman.uib.no/public/corpora/ and sophisticated settings, thus masking the over- 2019-December/030896.html by Mike Scott. all utility of IQ testing. 16https://www.inf.uni-hamburg.de/en/ inst/ab/lt/resources/data/germeval-2020- 17https://www.nobelprize.org/alfred-nobel/alfred-nobels- cognitive-motive.html thoughts-about-war-and-peace/

38 Some participants of the heated discussion of the GermEval20 Task 1 criticized this task for being “dangerously pseudoscientific”. To under- stand, what this criticism refers to, one must first understand pseudoscience. Hansson, a Swedish philosopher, first differentiate science from pseu- doscience in that scientists enjoy common raison d’etreˆ to provide the reader with the most epistem- ically warranted statements (Hansson, 2013, p. 62 Figure 6: Critics used sarcasm to express their view of ff.) by employing known and broadly respected the shared task being methodologically flawed. This, methods for finding those statements. paired with the scientific framing that GermEval offers lead to accusations of pseudoscience. The point of the Furthermore, Hansson describes the correspon- task, to research implicit motives, appears to have been dence between different scientific fields and disci- missed by some. plines that are interconnected. No given statement violates statements made by other disciplines and fields. they did not view appropriate for a task, that is – As for pseudoscience, authors are mostly di- in their view – methodologically flawed. vided as to which characteristics define pseudo- However, as shown in Section6 on the point science. However, two major characteristics ap- of discussion ii), many critics mistakenly assumed pear to be agreed upon by most authors: i) Non- that the task is about building an automated sys- science posing as science and ii) doctrinal compo- tem for ranking students or classifying IQ scores, nents (Hansson, 2017). whilst, in turn, it is only about researching the im- For pseudoscience to be posing as science plicit motive theory, as the response on the corpora 18 paramount effort is undertaken to mask statements mailing list shows: as being made with those scientific principles, even if they are not. As science offers advantages “[...] lending legitimacy to the use of similar of describing true phenomena and reality, pseu- tools that are used as a pseudoscientific mantle to doscientists strive for acceptance by readers with disguise (essentially) the automation of racial/eth- statements, that normally would not hold the thor- nic/cultural discrimination and biases [...]” ough process of scientific work. For pseudoscience to be of deviant doctrine, 7.3 Marketplace of ideas the pseudoscientists put sustained effort to pro- Even in case of the criticism on the GermEval20 mote standpoints different from those that have Task 1 setup, automation or IQ testing being le- scientific legitimacy. Thus, pseudoscientists disre- git and point to issues, there are still strong ethical gard major principles of scientific work, like cor- arguments for educational institutions against giv- respondence, consensus, and consistency, as well ing in, when broadly and publicly being exposed as transparent methodology, replicability or inter- to social media sanctions and calls for “pulling the subjectivity (Sahakian and Sahakian, 1993). task”19. As for the GermEval20 Task 1, critics saw either One of those arguments is the marketplace of non-scientific work being presented as scientific ideas, which was first discussed by John Stuart one or a doctrine, disregarding established meth- Mill in his 1859 book On Liberty (Mill, 2011). ods from corresponding scientific fields, which The marketplace of ideas is an analogy to the free are NLP and psychology. The main arguments market and assumes, that when ideas, statements, for calling the shared task pseudoscience is most and thoughts are presented with almost perfect in- likely the view, that since IQ testing is viewed by formation – that is, on a transparent, replicable many researchers as biased and unprecise, even 18 asking for machine learning systems would be https://mailman.uib.no/public/corpora/ 2019-December/030885.html by Yannick Versley. pseudoscientific. They view the methodology as 19https://medium.com/@emilymenonbender/ not being reconcilable with established ones. is-there-research-that-shouldnt-be- done-is-there-research-that-shouldn- Furthermore, discussion participants criticized t-be-encouraged-b1bf7d321bb6 by Emily M. that a shared task holds a scientific premise, which Bender.

39 and reliable basis –, only the truth will emerge stopped, as the so-called multiple discovery or si- from this available marketplace. Gordon (1997) multaneous invention principle calls for them to be describes the marketplace of ideas as a metaphor, made. This multiple discovery principle is the hy- where people speak and exchange ideas freely. pothesis, that most discoveries are made indepen- Freely means, that there is as little interference dently by multiple scientists at the same time, of- from the government and the society as possible. ten internationally. The Nobel price committee of- Reflected upon the GermEval Task 1, there is ten recognizes this hypothesis by rewarding multi- a violation of this philosophical and the liberal ple scientists who, at that time, did not collaborate principle of freely spoken ideas. Whilst the gov- directly. ernment did not interfere with ideas presented, This hypothesis is thought to be observable, the society did in the way of strong social media since discoveries, theories and scientific tools en- pressure and sanctions, not discussing the idea it- able practicing scientists of a field to now make self but rather demanding the idea to be stopped discoveries. As the circumstances are ideal in an without professional discourse (at least on Twit- internationally spread research community, simul- ter, as discussions on the corpora mailing list, were 20 taneous inventions are made possible. One exam- mostly argumentative ). ple is radar technology, which was discovered by At times, ideas, scientific interests, and projects multiple countries independently and at the same might provoke criticism due to a Zeitgeist, even if time (Galati, 2015). Thus, many believe the sup- they are legit and worth exploring. Darwin (1859) pression of scientific progress is not possible. was heavily criticized and even got his novel work on his theory of evolution banned. Criticism on On the other side, Martin (1978) argues, that a his theory not only arose from religious organiza- development of science and technology emerging tions but even from respected and well established from that science independent from the thoughts fellow researchers.21 However, if the marketplace and desires of single scientists and pressure from of ideas would have been applied to Darwin, his society are historically incorrect. In his article, the theory and findings would have been openly dis- author argues with selected examples, namely nu- cussed with all forms of scientific research, ar- clear power, food additives, transport policy, ge- guments, and findings, leaving it to the audience netic engineering, and automation – all of which and research community to determine, whether his are characterized as technologies, having emerged theory holds for the moment. If his ideas, however, from basic research and having experienced pres- would have been banned, as suggested by some sure and concerns from the research community critics of the GermEval20 Task 1, there would not and society. What the author does not argue about, have been an open debate. Furthermore, if Dar- is the value of basic research itself. He states that win’s theory was utterly wrong, it would not have the path of scientific and technological develop- been able to compete and thus vanished. ment is not usually predictable beforehand. Fur- thermore, Martin notes that concerns over scien- 7.4 Knowledge cannot be restrained tific and technological development has almost al- As Grashon (1983) describes, multiple researchers ways to do with applications and implications for announced to leave science, after having discov- the wider society. ered the knowledge of isolating DNA fragments At times, the research could have assumed what for the first time. They feared that this discovery negative impact a discovery or invention could would lead to political and social pressure. One of have on society, as Nobel, which invented the dy- those scientists even formed a group, categorically namite mainly for supporting mining, could have pressuring any scientific work on this genetic field. imagined the use for military purposes. Nonethe- Nonetheless, DNA sequencing has continued to be less, the individuals utilizing dynamite to build researched. weaponry are rather to blame than Nobel himself, There are implications, that – at least basic – even if he greatly regretted, that his discovery was research discoveries can not be fully prevented or used for such22. 20https://mailman.uib.no/public/corpora/ 2019-December/ 21https://en.wikipedia.org/wiki/ 22https://www.nobelprize.org/alfred-nobel/alfred-nobels- Reactions to On the Origin of Species thoughts-about-war-and-peace/

40 7.5 Pushing scientists out of academia As humans nowadays produce a vast amount of Whilst the US has spent 4,545.7 Mio. dollars digitally available textual resources, research of (Pece, 2020) in research and development (R&D) NLP applications could quickly lead to question- of computer sciences and mathematics, the US able and possibly dangerous results. The Ger- Department of Defense possessed a R&D budget mEval20 Task 1 has rightfully sparked a heated of 52,973.3 Mio. dollars, which is more than 40% debate upon the ethical considerations of this task, of the total US R&D budget. Some of the most as it not only involves NLP methods but further- influential advancements in computer science has more aptitude diagnostics, psychometrics, and IQ been researched behind closed doors for military testing – all of which can and have been misused. purposes such as the RSA cryptosystem, which However, as we have shown in this paper, the was already invented by the GCHQ four years be- three main topics of discussions, i) IQ testing and fore the later patented peer-reviewed method23 or biases, ii) reasons for building such a system and the predecessor of the internet, the ARPANET, iii) forbidden research, have positively be evalu- which was developed by the U.S. Airforce in 1969 ated in terms of their ethical indications. (O’Neill, 1995). IQ tests are very prone to biases. As the data Some private companies possess comparably was collected from a small university of applied large R&D budgets as well: Alphabet, the par- sciences in Germany, the peer groups are heteroge- ent company of the Google corp. spent 26,018 nous, no score can be reverse-engineered from the Mio. dollars on R&D 24. Even though the most available data and as the main point of the task recent scientific advancements were made open- is not to automate IQ testing but research implicit sourced and have been peer-reviewed, such as the motives, we believe the discussion to have lost bidirectional encoder representations from trans- track of what the task is truly about. formers (BERT, (Devlin et al., 2019)) and Ten- sorflow 1.0.0 (Fujita et al., 2017, p. 564), earlier This also leads to the second topic of discussion. developments, such as the Google PageRank al- We have shown that in Germany, high schools gorithm, which was kept hardly reproducable, de- already function as destructive socioeconomical spite even patents describing the basic procedure filters that discriminate against minorities. Im- (Lindberg, 2008). plicit motives have shown to be by far less biased One causality and risk of violations of the mar- and more neutral. If they were better understood ketplace of ideas is that researchers, which experi- and validated, aptitude diagnostics could finally ence pressure, might leave the public academia to move away from bias-prone metrics as high school pursue research in the private sector, which does grades or IQ scores. not necessarily publish research to be reviewed, Lastly, we have shown calls to forbid most re- discussed, and criticized by the public. This could search topics to not only be misleading, but harm- lead to knowledge monopolies, as well as fraudu- ful. In a marketplace of ideas, only truth can lent or misconducted research. emerge. The past has additionally shown that This is further reflected by the recent develop- progress is hardly containable. Moreover, pub- ment, that influential technology companies have lic shaming and condemnation of research ideas caused a so-called AI brain drain, meaning, that could lead to them moving to the private sector, many countries experience the emigration of AI which already has a lot of innovative power with- researchers. A national brain drain is observable out the necessity to present, discuss and have ideas from the public research sector and academia to criticized by peers – all of which are some key- private firms due to higher salaries, greater fund- stones of scientific work. ing, and at times more academic freedom (Kunze, We view this ethical discussion as partly un- 2019). objective but have also seen a valuable discourse 8 Discussion from most of the participants. It is right to view any research project critically. However, it is al- Whenever basic research leads to new technolo- ways important to closely investigating what a re- gies and applications, there is a risk of misuse. search idea is truly about and to honor scientific 23https://www.wired.com/1999/04/crypto/ freedom, as forbidding certain ideas puts this free- 24https://abc.xyz/investor/ dom at stake.

41 9 Acknowledgements Deborah Christie. 2005. Introduction to IQ testing. Psychiatry, 4:22–25. Firstly, we want to thank the NLP community, which participated in the important discourse of Brian K. Cooke, Dominque Delalot, and Tonia L. Werner. 2015. Hall v. Florida: Capital Punishment, this task. Especially the fellow researchers from IQ, and Persons With Intellectual Disabilities. Jour- the corpora list gave valuable input and points of nal of the American Academy of Psychiatry and the view. Law Online, 43(2):230–234. Furthermore, we want to thank Emily M. Ben- Charles Darwin. 1859. On the origin of species by der, which first raised reasonable concerns of the means of natural selection, or, The preservation of task and extended the discourse to Twitter, as well favoured races in the struggle for life, volume 1859. as to Medium, where she formulated two posts, London: John Murray. summarizing and evaluated main reasons for those Jacob Devlin, Ming-Wei Chang, Kenton Lee, and concerns. Also, we would like to thank her for Kristina Toutanova. 2019. BERT: Pre-training of participating in the KONVENS 2020 Ethics and Deep Bidirectional Transformers for Language Un- NLP panel. Additionally, we want to thank Twit- derstanding. In Proceedings of the 2019 Confer- ence of the North American Chapter of the Associ- ter users, which also took place in the discourse, ation for Computational Linguistics: Human Lan- providing even more and broader insights in social guage Technologies, Volume 1 (Long and Short Pa- implications of the task. pers), pages 4171–4186, Minneapolis, Minnesota, We want to thank Michele Loi, who wrote a pro- MN, USA. Association for Computational Linguis- tics. found ethical assessment of the task provided us with objective and neutral ethical arguments and Claudia Diehl and Patrick Fick. 2016. Ethnische agreed to participate in a constructive NLP + soci- Diskriminierung im deutschen Bildungssystem. In Claudia Diehl, Christian Hunkler, and Cornelia ety session alongside Emily M. Bender and Dirk Kristen, editors, Ethnische Ungleichheiten im Bil- Johannßen on the SWISSTEXT & KONVENS dungsverlauf: Mechanismen, Befunde, Debatten, conference. pages 243–286. Springer Fachmedien, Wiesbaden. We want to thank the SWISSTEXT & KON- Carol S. Dweck. 2017. From needs to goals and repre- VENS committee for supporting our task, neu- sentations: Foundations for a unified theory of moti- trally investigating the ethical impacts of it and vation, personality, and development. Psychological provided expertise and aid in determining the Review, 124(6):689–719. task’s risks and chances. Patricia Fernandez-Kelly. 2015. The Unequal Struc- We want to thank our colleagues and lab mem- ture of the German Education System: Structural bers, that supported us with advice and help in Reasons for Educational Failures of Turkish Youth keeping an overview and objective point of view. in Germany. Spaces & flows : an international jour- nal of urban and extraurban studies, 2:93–112. Last but not least, we want to thank the partic- ipants of the shared task for staying open-minded James Flynn. 1987. Massive IQ gains in 14 Nations: and interested in the research, providing founded What IQ tests really measure. Psychological Bul- letin, 101:171–191. empirical evidence of whether such a task is solv- able and discussing its impacts on the research Hamido Fujita, Ali Selamat, and Sigeru Omatu. 2017. community and society. New Trends in Intelligent Software Methodologies, Tools and Techniques: Proceedings of the 16th In- ternational Conference Somet 2017. Ios Press, Washington, DC, USA. References Gaspare Galati. 2015. A Simultaneous Invention – The Etienne Benson. 2003. Intelligent intelligence testing. Former Developments. In 100 Years of Radar, 1st Monitor of Psychology, 34(2):48–49. ed. 2016 edition, pages 55 – 77. Springer, , NY, USA. Lutz Bornmann. 2013. What is social impact of re- search and how can it be assessed? A literature sur- Bertram Gawronski and Jan De Houwer. 2014. Im- vey. Journal of the American Society for Informa- plicit measures in social and personality psychology. tion Science and Technology, 64:217–233. Handbook of research methods in social and person- ality psychology, 2:283–310. Annette Joy Braunack-Mayer. 2001. What makes a problem an ethical problem? An empirical perspec- Elliot S. Gershon. 1983. Should science be stopped? tive on the nature of ethical problems in general The case of recombinant DNA research. The Public practice. Journal of Medical Ethics, 27(2):98–103. interest, 71:3–16.

42 Jill Gordon. 1997. John Stuart Mill and the ”Mar- Yusen Liu, Fangyuan He, Haodi Zhang, Guozheng ketplace of Ideas”. Social Theory and Practice, Rao, Zhiyong Feng, and Yi Zhou. 2019. How Well 23(2):235–249. Do Machines Perform on IQ tests: a Comparison Study on a Large-Scale Dataset. In Proceedings of Alexander Gorke¨ and Armin Scholl. 2006. Niklas the Twenty-Eighth International Joint Conference on Luhmann’S theory of social systems and journalism Artificial Intelligence, IJCAI-19, pages 6110–6116. research. Journalism Studies - JOURNAL STUD, International Joint Conferences on Artificial Intelli- 7:644–655. gence Organization.

Sven O. Hansson. 2013. Defining pseudoscience and Han Maas, Kees-Jan Kan, and Denny Borsboom. 2014. science. University of Chicago Press, Chicago, IL, Intelligence Is What the Intelligence Test Measures. USA. Seriously. Journal of Intelligence, 2:12–15.

Sven O. Hansson. 2017. Science and Pseudo-Science. Brian Martin. 1978. Can scientific development be In Edward N. Zalta, editor, The Stanford Encyclo- stopped? Australian Science Teachers Journal, pedia of Philosophy, summer 2017 edition. Meta- 24(1):65–70. physics Research Lab, Stanford University. David C. McClelland. 1988. Human Motivation. Cam- Benedikt Hell, Sabrina Trapmann, and Heinz Schuler. bridge University Press. 2007. Eine Metaanalyse der Validitat¨ von fachspez- ifischen Studierfahigkeitstests¨ im deutschsprachigen David C. McClelland, Richard Koestner, and Joel Raum. Empirische Padagogik¨ , 21(3):251–270. Weinberger. 1989. How do self-attributed and implicit motives differ? Psychological Review, Dirk Hovy and Shannon Spruit. 2016. The Social Im- 96(4):690–702. pact of Natural Language Processing. In Proceed- ings of the 54th Annual Meeting of the Association John Stuart Mill. 2011. On Liberty. Cambridge Li- for Computational Linguistics (Volume 2: Short Pa- brary Collection - Philosophy. Cambridge Univer- pers), pages 591–598, Berlin, Germany. Association sity Press. for Computational Linguistics.

Dirk Johannßen and Chris Biemann. 2019. Neu- Jens Nachtwei and Carsten C. Schermuly. 2009. Acht Harvard Business ral classification with attention assessment of the Mythen uber¨ Eignungstests. Manager implicit-association test OMT and prediction of sub- , (04/2009):6–10. sequent academic success. In Proceedings of the 15th Conference on Natural Language Processing, Judy O’Neill. 1995. The role of ARPA in the develop- Annals of the KONVENS, Erlangen, Germany. German Society for ment of the ARPANET, 1961-1972. History of Computing, IEEE Computational Linguistics & Language Technology. , 17:76 – 81.

Dirk Johannßen, Chris Biemann, Steffen Remus, Timo Deniz S. Ones, Stephan Dilchert, Chockalingam Baumann, and David Scheffer. 2020. GermEval Viswesvaran, and Jesus´ F. Salgado. 2017. Cogni- 2020 Task 1 on the Classification and Regression tive ability: Measurement and validity for employee of Cognitive and Motivational style from Text. In selection. In Handbook of Employee Selection, Sec- Proceedings of the GermEval 2020 Task 1 Workshop ond Edition, pages 251–276. Taylor and Francis. in conjunction with the 5th SwissText & 16th KON- VENS Joint Conference 2020, pages 1–10, Zurich, Christopher Pece. 2020. Federal R&D Obligations In- Switzerland (online). German Society for Computa- crease 8.8% in FY 2018; Preliminary FY 2019 R&D tional Linguistics & Language Technology. Obligations Increase 9.3% Over FY 2018. Technical Report 20-308, National Science Foundation. Rolf-Torsten Kramer, Werner Helsper, Sven Thier- sch, and Carolin Ziems. 2009. Selektion James W. Pennebaker, Cindy K. Chung, Joey Frazee, und Schulkarriere: Kindliche Orientierungsrahmen Gary M. Lavergne, and David I. Beaver. 2014. beim Ubergang¨ in die Sekundarstufe I. Studien When Small Words Foretell Academic Success: The zur Schul- und Bildungsforschung. VS Verlag fur¨ Case of College Admissions Essays. PLOS ONE, Sozialwissenschaften. 9(12):e115844.

Lars Kunze. 2019. Can We Stop the Academic AI Robert Plomin and Ian J. Deary. 2015. Genetics and in- Brain Drain? KI - Kunstliche¨ Intelligenz, 33(1):1–3. telligence differences: five special findings. Molec- ular Psychiatry, 20(1):98–108. Detlev Liepmann, Andre´ Beauducel, Burkhard Brocke, and Rudolf Amthauer. 2007. Intelligenz-Struktur- Ajitha Reddy. 2007. The eugenic origins of IQ testing: Test 2000 R. Hogrefe Verlag, Gottingen,¨ Germany. Implications for post-Atkins litigation. DePaul L. Rev., 57:667. Van Lindberg. 2008. Intellectual Property and Open Source: A Practical Guide to Protecting Code. Detlef H. Rost. 2009. Intelligenz: Fakten und Mythen. O’Reilly & Associates, Sebastopol, CA, USA. Beltz, Weinheim, Germany.

43 John P. Rushton and Arthur R. Jensen. 2005. Thirty Yingyi Zhang, Jing Li, Yan Song, and Chengzhi Zhang. years of research on race differences in cogni- 2018. Encoding Conversation Context for Neural tive ability. Psychology, Public Policy, and Law, Keyphrase Extraction from Microblog Posts. In 11(2):235–294. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computa- William S. Sahakian and Mabel L. Sahakian. 1993. tional Linguistics: Human Language Technologies, Ideas of the Great Philosophers. Barnes & Noble, Volume 1 (Long Papers), pages 1676–1686, New New York, NY, USA. Orleans, LA, USA. Association for Computational Linguistics. Robert M. Sanger. 2015. IQ, Intelligence Tests, ’Eth- nic Adjustments’ and Atkins. SSRN Scholarly Pa- per ID 2706800, Social Science Research Network, Rochester, NY, USA. David Scheffer. 2004. Implizite Motive: Entwicklung, Struktur und Messung [Implicit Motives: Develop- ment, Structure and Measurement]. Hogrefe Verlag, Gottingen,¨ Germany. David Scheffer and Julius Kuhl. 2013. Auswertungs- manual fur¨ den Operanten Multi-Motiv-Test OMT. sonderpunkt Verlag, Munster,¨ Germany. Frank L. Schmidt and John E. Hunter. 1998. The valid- ity and utility of selection methods in personnel psy- chology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124(2):262–274. Oliver C. Schultheiss. 2008. Implicit motives. Hand- book of personality: Theory and research, pages 603–633. Claude E. Shannon. 1948. A mathematical theory of communication. Bell System Technology Journal, 27(3):379–423. Yulia Tsvetkov, Vinodkumar Prabhakaran, and Rob Voigt. 2018. Socially Responsible NLP. In Pro- ceedings of the 2018 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Tutorial Abstracts, pages 24–26, New Orleans, LA, USA. Association for Computational Linguistics. Jonathan B. Tucker. 2012. Innovation dual use and se- curity: Managing the risks of emerging biological and chemical technologies. MIT Press. Eric Turkheimer, Andreana Haley, Mary Waldron, Brian D’Onofrio, and . 2003. So- cioeconomic Status Modifies Heritability of IQ in Young Children. Psychological science, 14:623–8. Siamak Vahidi. 2015. Intelligence Testing and Cul- tural Diversity: Pitfalls and Promises | The National Research Center on the Gifted and Talented (1990- 2013). Library Catalog: nrcgt.uconn.edu. David Wechsler. 2011. WASI-II: Wechsler abbreviated scale of intelligence. NCS Pearson, San Antonio, TX, USA. Bryn Williams-Jones, Catherine Olivier, and Elise Smith. 2014. Governing ‘Dual-Use’ Research in Canada: A Policy Review. Science and Public Pol- icy, 41:76–93.

44