Research Articles: Behavioral/Cognitive When implicit prosociality trumps selfishness: the neural valuation system underpins more optimal choices when learning to avoid harm to others than to oneself https://doi.org/10.1523/JNEUROSCI.0842-20.2020
Cite as: J. Neurosci 2020; 10.1523/JNEUROSCI.0842-20.2020 Received: 8 April 2020 Revised: 9 June 2020 Accepted: 7 July 2020
This Early Release article has been peer-reviewed and accepted, but has not been through the composition and copyediting processes. The final version may differ slightly in style or formatting and will contain links to any extended data.
Alerts: Sign up at www.jneurosci.org/alerts to receive customized email alerts when the fully formatted version of this article is published.
Copyright © 2020 the authors
1 Title: When implicit prosociality trumps selfishness: the neural valuation system underpins more
2 optimal choices when learning to avoid harm to others than to oneself
3 Abbreviated title: Neural valuation underpins prosocial choices
4 Authors: Lukas L. Lengersdorff 1, Isabella C. Wagner 1, Patricia L. Lockwood 2,3, and Claus Lamm 1
5 Affiliation:
6 1 Social, Cognitive and Affective Neuroscience Unit, Department of Cognition, Emotion, and Methods
7 in Psychology, Faculty of Psychology, University of Vienna, Liebiggasse 5, 1010 Vienna, Austria
8 2 Department of Experimental Psychology, University of Oxford, Oxford OX1 3PH, United Kingdom
9 3Centre for Human Brain Health, University of Birmingham, Birmingham B15 2TT, United Kingdom
10 Corresponding author: Claus Lamm, [email protected]
11 Number of pages: 49
12 Number of figures: 4
13 Number of tables: 4
14 Number of words: Abstract: 247 | Significance statement: 114 | Introduction: 645 | Discussion: 1651
15 Conflict of interest statement:
16 The authors declare no competing financial interests.
17 Acknowledgements:
18 This work was funded by the Vienna Science and Technology Fund (WWTF VRG13-007). We would
19 like to thank Lei Zhang and Nace Mikus for helpful discussions concerning the modeling analyses, and
20 feedback on the manuscript, and David Sastre-Yagüe, Gloria Mittmann, André Lüttig, Sophia Shea
21 and Leonie Brög for assistance during data collection.
1
22 Abstract
23 Humans learn quickly which actions cause them harm. As social beings, we also need to learn to
24 avoid actions that hurt others. It is currently unknown if humans are as good at learning to avoid
25 others’ harm (prosocial learning) as they are at learning to avoid self-harm (self-relevant learning).
26 Moreover, it remains unclear how the neural mechanisms of prosocial learning differ from those of
27 self-relevant learning. In this fMRI study, 96 male human participants learned to avoid painful stimuli
28 either for themselves or for another individual. We found that participants performed more
29 optimally when learning for the other than for themselves. Computational modeling revealed that
30 this could be explained by an increased sensitivity to subjective values of choice alternatives during
31 prosocial learning. Increased value-sensitivity was further associated with empathic traits. On the
32 neural level, higher value-sensitivity during prosocial learning was associated with stronger
33 engagement of the ventromedial prefrontal cortex (VMPFC) during valuation. Moreover, the VMPFC
34 exhibited higher connectivity with the right temporoparietal junction during prosocial, compared to
35 self-relevant, choices. Our results suggest that humans are particularly adept at learning to protect
36 others from harm. This ability appears implemented by neural mechanisms overlapping with those
37 supporting self-relevant learning, but with the additional recruitment of structures associated to the
38 social brain. Our findings contrast with recent proposals that humans are egocentrically biased when
39 learning to obtain monetary rewards for self or others. Prosocial tendencies may thus trump
40 egocentric biases in learning when another person‘s physical integrity is at stake.
2
41 Significance Statement
42 We quickly learn to avoid actions that cause us harm. As “social animals”, we also need to learn and
43 consider the harmful consequences our actions might have for others. Here, we investigated how
44 learning to protect others from pain (prosocial learning) differs from learning to protect oneself (self-
45 relevant learning). We found that human participants performed better during prosocial learning
46 than during self-relevant learning, as they were more sensitive towards the information they
47 collected when making choices for the other. Prosocial learning recruited similar brain areas as self-
48 relevant learning, but additionally involved parts of the “social brain” that underpin perspective-
49 taking and self-other distinction. Our findings suggest that people show an inherent tendency
50 towards “intuitive” prosociality.
3
51 Introduction
52 To ensure survival, it is essential that we learn to refrain from actions that cause ourselves harm.
53 Physical pain acts as a powerful learning signal, indicating the immediate need to adjust our behavior
54 to avoid injury (Wiech and Tracey, 2013; Vlaeyen, 2015; Tabor and Burr, 2019). As social beings, we
55 also have to learn and adapt our behavior to avoid harm to others, and “interpersonal harm
56 aversion” has been proposed as the basis of prosocial behavior and morality (Gray et al., 2012;
57 Crockett, 2013; Chen et al., 2018; Decety and Cowell, 2018). However, it remains unknown if humans
58 are as good at learning to avoid harm to others (prosocial learning) as they are at learning to avoid
59 harm to themselves (self-relevant learning). Moreover, while the neural underpinnings of self-
60 relevant learning are well-established, the mechanisms behind prosocial learning remain unclear.
61 Recent evidence suggests that other people's physical integrity is a highly relevant factor for human
62 behavior. Crockett and colleagues (2014, 2015) found that individuals are willing to spend more
63 money to protect others from pain than to protect themselves. This suggests that individuals are
64 “hyperaltruistic“ in situations where they can deliberately weigh self- and other-relevant outcomes.
65 It remains unclear, though, if hyperaltruism extends to situations where prosociality depends on
66 more implicit processes, such as operant learning (Zaki and Mitchell, 2013). Other studies have
67 proposed an egocentric bias in learning: Participants learned more slowly to gain financial rewards,
68 and to avoid financial losses, for others than for themselves (Kwak et al., 2014; Lockwood et al.,
69 2016), and to associate objects with others compared to oneself (Lockwood et al., 2018). Crucially, it
70 is thus possible that humans show superior self-relevant learning in the context of financial outcomes
71 and basic associative learning, but similar, or superior, prosocial learning when another person's
72 health is at stake. Moreover, prosocial learning performance might vary greatly between individuals,
73 due to differences in socio-cognitive traits such as empathy (Lockwood et al., 2016; Olsson et al.,
74 2016; Lamm et al., 2019). Here, we therefore investigated people’s performance in prosocial learning
75 in the context of harm-avoidance, and how it is associated with empathic traits.
4
76 Advances in the neuroscience of reinforcement learning (Lee et al., 2012) suggest that self-relevant
77 learning is implemented by neural systems for two processes: valuation, i.e., choosing between
78 alternatives based on their subjective values, and outcome evaluation, i.e., updating values in
79 response to choice outcomes. Converging evidence suggests that both self- and other-relevant
80 valuation engages the ventromedial prefrontal cortex (VMPFC; Kable and Glimcher, 2009; Ruff and
81 Fehr, 2014). Activation of the VMPFC related to valuation affecting others appears further modulated
82 by brain areas connected to social cognition, such as the temporoparietal junction (TPJ), a region
83 linked to self-other distinction and perspective-taking (Hare et al., 2010; Janowski et al., 2013). With
84 respect to outcome evaluation, it has been found that learning based on pain as a feedback signal
85 engages the anterior cingulate cortex (ACC) and the anterior insula (AI), both when pain is directed to
86 oneself, and when witnessing pain in conspecifics (Olsson et al., 2007; Lindström et al., 2018; Keum
87 and Shin, 2019). It is thus likely that these brain areas also underpin prosocial learning, but the extent
88 of their involvement is unknown.
89 Here, we conducted a high-powered study with male human participants (N=96) who learned to
90 avoid painful stimuli either for themselves or another individual. Combining computational modeling
91 with fMRI, we tested whether participants were better, or worse, at learning to avoid others’ harm
92 compared to self-harm. We expected that differences in learning behavior should be reflected by
93 different recruitment and connectivity of the VMPFC during valuation, as well as the ACC and AI
94 during outcome evaluation. Our major aim was to clarify whether humans are actually “selfish”
95 learners, as suggested by previous evidence using monetary outcomes as learning signals, or if
96 prosocial tendencies can trump egocentricity in the face of harm.
5
97 Materials and methods
98 Data reported here were acquired as part of a longitudinal project investigating the effects of violent
99 video games on social behavior (outside the scope of the present report). In brief, participants
100 completed two fMRI sessions approximately two weeks apart. After the first session, participants
101 came into the behavioral laboratory for seven times over the course of two weeks to complete a
102 violent (or non-violent) video game training (discussed elsewhere). Here, our focus is exclusively on
103 the data from the first session, prior to video game training. Note that at this point of the
104 experiment, participants had not yet been exposed to any experimental manipulation, instructions,
105 or other types of information associated with violent video games. This ensures that the present
106 results are not influenced by the design of the overarching study, and that we could exploit the high
107 sample size of that study to pursue the present scientific aims related to prosocial learning.
108 Participants
109 Ninety-six male volunteers (age range: 18 - 35 years) participated in the study. The novelty of the
110 present design precluded formal power analysis. As a benchmark, though, we considered the effect
111 size reported by Lockwood and colleagues (2016), who used a similar task, but with financial rewards
112 as positive reinforcers, rather than pain as negative reinforcers. They found an effect size of d = 0.87
113 when comparing the learning rates for self-relevant learning to the learning rates for prosocial
114 learning. With a sample size of 96, we had a power > 99.9% to find an effect of this size in our study.
115 Sensitivity analysis further indicated that we had a power of 80% to detect an effect of d = 0.29,
116 suggesting that our study was adequately powered to detect medium-to-small effects (Cohen, 2013).
117 Power analyses were performed with the software Gpower 3 (Faul et al., 2007). All participants were
118 healthy, right-handed, had normal or corrected-to-normal vision, reported no history of neurological
119 or psychiatric disorders or drug abuse, and fulfilled standard inclusion criteria for MRI
120 measurements. Only male participants of the age range 18 – 35 years were tested based on
121 constraints related to the overarching project on violent video games. Hence, the findings presented
122 here only apply to this population. All participants gave written consent prior to participation and 6
123 received financial reimbursement, and the study had been approved by the Ethics Committee of the
124 Medical University of Vienna.
125 Study timeline and procedures
126 Participant arrival and interaction with the confederate
127 Each participant was paired with another male participant who in reality was a member of the
128 experimental team (i.e., a confederate). This way, we ensured that the participant was under the
129 impression of completing the prosocial learning task together with another person (see below for
130 details of the task). As a cover story, the participant was told that he had been randomly assigned to
131 the role of the "MRI participant", and would complete the experimental tasks inside the MR scanner,
132 while the confederate had been assigned to the role of "pulse measures participant" from whom we
133 would take pulse measurements only. After arrival at the MR facility, it was explained that the
134 participant would first receive task explanations in another room while the confederate would
135 complete the pain calibration, and that they would switch places afterwards (only the participant
136 actually underwent the experimental procedures). After the pain calibration (see below), the
137 participant was positioned in the MR scanner and the confederate was seated next to the scanner at
138 a table with an MR compatible monitor and keyboard. To uphold the impression of the confederate
139 actually participating in the task, the confederate remained seated next to the scanner for the entire
140 duration of the experiment, and also responded audibly to the experimenter’s mock questions about
141 his physical comfort via the intercom.
142 Participants were formally interviewed if they doubted the deception (e.g., if they believed that the
143 confederate was a real participant) at the end of the second session of the overarching project (two
144 weeks later). At this point, six participants reported doubts that the confederate was a real
145 participant. No participant expressed doubts about the confederate after the first session, but please
146 note that they were not explicitly interviewed regarding doubts at that point in time, to not raise
147 suspicions about the cover story.
7
148 Electrical stimulation and pain calibration
149 Electrical stimulation was delivered with a Digitimer DS5 Isolated Bipolar Constant Current Stimulator
150 (Digitimer, Ltd., Clinical & Biomedical Research Instruments) using concentric surface electrodes with
151 7 mm diameter and a platinum pin (WASP electrode, Specialty Developments) attached to the
152 dorsum of the left hand. With this setup, electrical stimulation was constrained to the skin under and
153 directly adjacent to the electrodes. To determine each participant’s subjective pain threshold, we
154 employed an adaptive staircase procedure (as previously used in Rütgen et al., 2015b, 2015a). In
155 brief, the participant received short stimuli (500 ms) of increasing intensity and was asked to rate
156 pain intensity on a ten-point numeric scale from 0 to 9. The numbers corresponded to the following
157 subjective perceptions: 0 = "not perceptible"; 1 = "perceptible, but not painful", 3 = "a little painful",
158 5 = "moderately painful", 7 = "very painful", 9 = "extremely painful, highest tolerable pain". To
159 account for habituation and familiarization effects, this procedure was repeated once, after which
160 the participant received 30 random electrical stimuli. The average intensities of electrical stimuli that
161 were rated as 1 and 7 were then chosen as the intensities of non-painful and painful stimulation
162 during the pain-avoidance task.
163 Prosocial learning task
164 The participant was instructed to learn the association between abstract symbols and painful
165 electrical stimuli. During each trial (Figure 1A), the participant had to choose one of two symbols,
166 with one the symbols resulting in non-painful electrical stimulation in 70% of trials, and the other in
167 painful electrical stimulation in 30% of the trials (note that participants were not informed about
168 these contingencies). Importantly, electrical stimulation as a consequence of the choice made was
169 either delivered to the participant (Self condition) or to the other person, i.e., the confederate (Other
170 condition). For both conditions, the participant completed 3 blocks of 16 trials (resulting in 48 trials
171 per condition, and 96 trials in total). The two experimental conditions alternated per block (so the
172 participants never completed the same condition twice in a row), and the starting condition was
8
173 counter-balanced across participants. Each block contained a different pair of symbols, and the
174 participant was instructed to learn this new set of symbol-stimulus contingencies once more.
175 At the start of a new block, the participant was instructed that he would now have to perform the
176 task for himself ("Play for yourself") or for the confederate (e.g., "Play for Michael”). During each
177 trial, the two symbols were presented for 3000 ms and the participant was asked to choose one
178 symbol (choice phase). Choices were made with the right hand via a button box. If the participant
179 made a choice within this time limit, a red or blue arrow (presented for 1000 ms) indicated whether
180 the current recipient would receive painful or non-painful stimulation, respectively (outcome phase).
181 The arrow pointed downwards if the participant received electrical stimulation himself (self-directed
182 electrical stimulation) or to the right towards the confederate (other-directed electrical stimulation).
183 If no button press occurred within the time limit, the message "too slow!" was displayed and was
184 followed by painful electrical stimulation to either the participant or the confederate, depending on
185 the current condition. After an interval of 2500 to 4500 ms (uniformly jittered, in steps of 250 ms),
186 the electrical stimulation was delivered (stimulation phase). During self-directed electrical
187 stimulation (500 ms), the participant saw a pixelated photograph of himself (1000 ms, same onset as
188 electrical stimulus). Pain intensity was indicated by a red or blue lightning icon in the lower right of
189 the photograph. During other-directed electrical stimulation, the participant saw a photograph of the
190 confederate with a neutral or painful facial expression, during non-painful or painful electrical
191 stimulation respectively. The next trial started after an inter-trial interval between 2500 and 4500 ms
192 (uniformly jittered, in steps of 250 ms). The task was presented using COGENT
193 (http://www.vislab.ucl.ac.uk/cogent.php), implemented in MATLAB 2017b (The MathWorks Inc.,
194 Natick, MA, USA). The total task duration was approx. 23 min.
195 As part of the cover story, the participant was instructed that the other participant (i.e., the
196 confederate) would never make choices that could result in stimulation for himself or the participant.
197 Instead, he would be presented with the choices of the participant, and would have to indicate per
198 button press if he would have made the same decision. We chose this approach (instead of saying
9
199 that the confederate would only passively receive stimulation without any further instructions) to
200 increase the believability of the deception.
201 MRI data acquisition
202 MRI data were acquired with a 3 Tesla Siemens Skyra MRI system (Siemens Medical, Erlangen,
203 Germany) and a 32-channel head coil. Blood oxygen level dependent (BOLD) functional imaging was
204 performed using a multiband accelerated echoplanar imaging sequence with the following
205 parameters: Echo time (TE): 34 ms; Repetition time (TR): 1200ms; flip angle: 66°; interleaved
206 ascending acquisition; 52 axial slices coplanar to the connecting line between anterior and posterior
207 commissure; multiband acceleration factor 4, resulting in 13 excitations per TR; field-of-view: 192
208 ×192 × 124.8 mm, matrix size: 96 × 96, voxel size: 2 × 2 × 2 mm, interslice gap 0.4 mm. Structural
209 images were acquired using a magnetization-prepared rapid gradient-echo sequence with the
210 following parameters: TE = 2.43 ms; TR = 2300 ms; 208 sagittal slices; field-of-view: 256 × 256 × 166
211 mm; voxel size: 0.8 × 0.8 × 0.8 mm. To correct functional images for inhomogeneities of the magnetic
212 field, field map images were acquired using a double echo gradient echo sequence with the following
213 parameters: TE1/TE2: 4.92/7.38 ms; TR = 400 ms; flip angle: 60°; 36 axial slices with the same
214 orientation as the functional images; field-of-view: 220 × 220 × 138 mm; matrix size: 128 × 128 × 36;
215 voxel size: 1.72 × 1.72 × 3.85 mm.
216 Trait measures
217 As a measure of trait empathy, participants completed the Questionnaire of Cognitive and Affective
218 Empathy (QCAE) (Reniers et al., 2011), which assesses empathic traits along the two dimensions
219 cognitive empathy (with the subdimensions perspective taking and online simulation) and affective
220 empathy (with the subdimensions emotion contagion, proximal responsivity and peripheral
221 responsivity). The questionnaire was completed after the MR session. Due to an error in data
222 collection, questionnaire data from nine participants were not available for analysis. We therefore
223 restricted analyses related to trait measures of empathy to the remaining 87 participants.
10
224 Data analysis
225 Our analysis plan was as follows: first, we tested whether participants showed better learning
226 behavior during prosocial or self-relevant learning, as measured by an objective index of optimal
227 behavior (described in detail below). Then, we used computational modeling to investigate if
228 behavioral differences could be explained by differences in the mechanisms underlying these two
229 types of learning. Finally, we used fMRI analyses to relate differences on the behavioral level to
230 differences in the neural mechanisms.
231 Analysis of optimal behavior
232 We first investigated if participants showed more optimal behavior during prosocial or self-relevant
233 learning. In decision theory, optimal choices are defined as those choices that maximize the expected
234 rewards (or minimize the expected loss), given all available information (Kulkarni and Gilbert, 2011).
235 Mathematically, our learning task is equivalent to a two-armed bandit problem with unknown
236 reward probabilities of the two choice alternatives, pA and pB. For this kind of problem, the optimal
237 choice in each trial can be derived algorithmically, as described by Steyvers, Lee and Wagenmakers
238 (2009; see also Kaelbling et al., 1996). For mathematical details, we refer the reader to these
239 references, and to our commented scripts (see section Code and data availability). In short, a choice
240 is optimal if an "ideal participant" with perfect memory and the ability to calculate all possible
241 outcome sequences would make the same choice, given the information at hand. Importantly, an
242 optimal choice maximizes not only the expected outcome of the current trial, but also the possible
243 outcomes of subsequent trials, balancing exploration and exploitation. Optimal choices also depend
244 on the prior information the decider has about the choice alternatives. This can be modelled by
245 defining a prior distribution for the reward probabilities pA and pB. Following Steyvers and colleagues
246 (2009), we modelled prior information about the learning task with a Beta distribution, with
247 parameters a and b set to 1.5. This gives a symmetric distribution with considerable probability mass
248 for all but the most extreme values of pA and pB. Thus, the distribution reflects that participants were
249 informed that both symbols will lead to painful stimulation in some cases, but did not get any further
11
250 information about the probabilities. To investigate how sensitive our results were to these exact
251 values of a and b, we also calculated optimal choices under Beta distributions with both values set to
252 1 (reflecting a uniform distribution) or 2 (putting more probability mass on average probabilities).
253 Preprocessing, analysis and visualization of behavioral data were performed using R 3.4.4 (R Core
254 Team, 2017). To test for differences in the number of optimal choices between prosocial and self-
255 relevant learning, we computed a generalized linear mixed model (GLMM), with optimal choice as
256 binary response variable (binomial family, logit link function), using the R packages lme4 (Bates et al.,
257 2015) and afex (Singmann et al., 2020). Trial (1 to 16; numeric, mean centered), condition (Self vs.
258 Other; categorical dichotomous; coded as Self = -1, Other = 1) and their interaction trial x condition
259 were specified as predictors. To account for dependencies between observations within the same
260 participants, we treated participants as levels of a random factor. Following the recommendation of
261 Matuschek and colleagues (2017), we selected the random effects structure that led to the most
262 parsimonious fit of the GLMM, as indicated by the Akaike Information Criterion (AIC; Akaike, 1998)
263 We then tested the significance of the fixed effects of the final model, using Type III drop-one-term
264 likelihood ratio tests, as implemented by the “mixed” function of the package afex.
265 Computational modeling
266 We used computational modeling to investigate differences in the processes underlying prosocial
267 and self-relevant learning. In computational modeling, one first defines a model as a set of
268 mathematical equations that relate observable behavior to theoretical quantities. If the model is true
269 (or, at least, useful; Box, 1976), it should help explain the observed behavior of a participant.
270 Importantly, the exact relationship between a models’ theoretical quantities and actual behavior is
271 governed by a number of free parameters, which have to be estimated from the data. Using
272 techniques of model comparison, one can then select the most useful model from a set of plausible
273 candidates. Here, we utilized reinforcement learning (RL) models, which already have a long history
274 of use in neuroscientific research (Hackel and Amodio, 2018). Following this modeling framework, we
275 assumed that participants chose between symbols A and B based on the symbols' subjective values
12
276 (valuation). Then, they updated these subjective values based on the outcome of their choice
277 (outcome evaluation).
278 Models for valuation
279 We used the softmax function to model how subjective values are mapped onto choice probabilities.
280 This function takes as input the subjective values of symbols A and B (here denoted as and , and
281 bound in the interval [0,1]), and gives as output the probability of choosing one symbol over the
282 other. In the case of only two alternatives/symbols, the softmax simplifies to the logistic function,
283 and is thus
1 ℎ = , , , = , 1+ , ,
284 where ℎ = , , , is the probability of choosing symbol "A" in trial t, given specific
285 values of , and , , e is the basis of the exponential function, and β is the free inverse
286 temperature parameter, defined on the positive real numbers (0, ∞). Note that in the simple case of
287 two symbols, the probability of choosing symbol "B" is given by 1 − ℎ = , , , . The
288 probability of choosing symbol "A" increases nonlinearly (in a sigmoid shape) with the difference
289 , − , ; for large positive values of the difference, the probability approaches 1, while for strongly
290 negative values of the difference, the probability approaches 0. If the difference is close to 0, the
291 probability is close to 0.5, that is, the choices of the participant are nearly random. Importantly, the
292 free parameter β defines how sensitive the softmax function, and thus the participants’ choices, are
293 to the difference in subjective values. For very high β, even small differences in subjective values lead
294 to a high probability of choosing the symbol with the higher value. For β close to 0, choices are very
295 random even if the difference in subjective values is high. In this paper, we will refer to β as a
296 measure of value-sensitivity, consistent with the parameter’s mathematical definition in the RL
297 model (see also Katahira, 2015; Chung et al., 2017). Importantly, this interpretation is also consistent
298 with the interpretation of β as an indicator of the exploration/exploitation trade-off (e.g., Humphries
299 et al., 2012; Cinotti et al., 2019). In other words, a high β may indicate exploitative choices, as choice
300 behavior is very sensitive to differences in value, and the option with the higher value is frequently 13
301 chosen. In contrast, a low β leads to choices that are less dependent on the value difference, which
302 may reflect the exploration of the alternative with the lower values.
303 Importantly, to test for differences between self-relevant and prosocial learning, we also considered
304 models with condition-specific parameters (see Lockwood et al., 2016, for a similar approach). Thus,
305 we also estimated models in which different inverse temperature parameters β and β
306 governed participants' choice behavior during self-relevant and prosocial learning, respectively.
307 Differences here would indicate that a participant showed differences in value sensitivity between
308 the two conditions, i.e., for self-relevant and prosocial learning.
309 Models for outcome evaluation
310 Following the RL modeling framework, we assumed that participants updated the subjective value of
311 the chosen symbol based on the received outcome. Here, we will refer to outcomes that indicate the
312 delivery of non-painful stimulation as positive outcomes, and to outcomes that indicate painful
313 stimulation as negative outcomes. According to the simplest RL model (Sutton and Barto, 1998;
314 Hackel and Amodio, 2018), subjective values are updated according to the formula
V , = , + α − , ,
315 where V , is the value of the chosen symbol "A" in trial t, V , is the updated value, and α is the
316 free learning rate parameter, which is bound in the interval [0,1]. is the number encoding the
317 outcome of trial t, here defined as 1 for positive outcomes/”no pain”, and 0 for negative
318 outcomes/”pain”. We chose this coding scheme to make our algorithm more comparable to those
319 used in other studies using the RL model (e.g., Lockwood et al., 2016). But note that this coding
320 scheme is mathematically equivalent to that used in other aversive learning paradigms where pain
321 outcomes are instead coded as -1 and avoidance of pain as 0 (e.g., Seymour et al., 2004; Roy et al.,
322 2014; for a mathematical proof of this statement, see https://osf.io/h9txe/). The term − , is
323 usually referred to as the prediction error of trial t. The free parameter α controls the degree to
324 which the value of the chosen symbol is updated by the prediction error. For α close to 0, there is
325 nearly no updating. For α close to 1, the subjective value is strongly updated by the last outcome.
14
326 Note though that the learning rate α cannot directly be interpreted as a measure of how fast
327 participants understand which of the two symbols is better (see also Zhang et al., 2020). Rather, it
328 describes how strongly subjective values are influenced by the last outcome, regardless of the
329 outcomes that came before. V , and V , were both initialized to 0.5, reflecting the assumption that
330 participants were equally likely to prefer either symbol on the first trial, and that they had no prior
331 information about symbol-outcome contingencies.
332 Participants might have responded differently to positive outcomes (i.e., non-painful stimulation)
333 than to negative outcomes (i.e., painful stimulation). We therefore also considered an extension of
334 the simple RL model which allowed for outcome-specific learning rate parameters (Den Ouden et al.,
335 2013). In this model, participants weighted positive prediction errors with the learning rate α , and
336 negative prediction errors with the learning rate α . Crucially to our research question, we also
337 estimated models in which condition-specific learning rate parameters α and α controlled
338 the rate with which participants' updated the subjective values during self-relevant and prosocial
339 learning, respectively. If we found beforehand that outcome-specific learning rate parameters are of