Experimental Protocol, Design, and Timeline: Destructive Behavior, Judgment, and Economic Decision-making under Thermal Stress

Ingvild Alm˚as,Maximilian Auffhammer, Tessa Bold, Ian Bolliger, Aluma Dembo, Solomon Hsiang, Shuhei Kitamura, Edward Miguel, and Robert Pickmans Not for publication

1 Contents

Appendices 3

H Lab environment 3 H.1 General description ...... 3 H.2 Temperature ...... 3 H.3 Relative humidity ...... 6 H.4 Carbon dioxide ...... 7 H.5 Sound and lighting ...... 8

I Lab Protocol and Specifications 10 I.1 Economic preferences ...... 15 I.1.1 Risk ...... 15 I.1.2 Time ...... 18 I.2 Social behavior ...... 21 I.2.1 Fairness (Real Effort Dictator Game) ...... 21 I.2.2 Public contribution (Public Goods Game) ...... 22 I.2.3 Trust ...... 23 I.2.4 Joy of Destruction ...... 25 I.2.5 Charitable donation ...... 27 I.3 Affect ...... 28 I.4 Cognition ...... 29 I.4.1 Precision task ...... 29 I.4.2 Fluid Intelligence ...... 30 I.4.3 Cognitive reflection ...... 31 I.5 Demographic survey questions ...... 32 I.6 Debriefing questions from the post- survey ...... 37

J Experimental timeline 40

K Statistical power and MDEs 46

2 H Lab environment

H.1 General description To ensure that our experiment was isolating the temperature treatment effect, we measured and controlled for several other environmental characteristics that could potentially induce an effect on an individual’s performance in one or more of the experimental modules. The field of building science has conducted extensive research on the effects of indoor environment on cognitive performance and workplace productivity. Because the effects on interpersonal dynamics, generosity, and the other measures we test in this study are less well-known, we use the cognitive performance literature to guide our selection of environmental characteristics to measure and control in our experiment. In addition to air temperature, this list includes operative temperature, relative humidity, carbon dioxide, lighting, and background noise. A thorough summary of the effects of indoor environment on productivity can be found in Wargocki et al. (2006). For each environmental characteristic, the following sub-sections briefly discuss the rea- soning behind each one’s inclusion and summarize the measurements we obtained across sites and condition groups. The sensors used for measuring each of the characteristics are summa- rized in Experimental Protocol, Design, and Timeline (EPDT) Table H.1.1. In California, temperature control in the treatment room consisted of a combination of two Tangkula Electric Oil Filled Radiator Heaters (1500 W) and a Sanyo KS0951/C0951 Split System Air Conditioning unit set to fan-mode only (for air circulation). The thermostat setting on the radiators was tuned during the pilot study such that the measured air temperature was as close as possible to 30◦C when the room was filled with participants. In the control room, an identical A/C unit was set such that the measured air temperature was as close as possible to 22◦C when the room was filled with participants. In Nairobi, the temperature control equip- ment consisted of a joint heating+A/C wall-unit in both treatment and control rooms, plus an additional space heater in the treatment room, needed to reach our target temperature of 30◦C. All sensors and control equipment, with the exception of the operative temperature sensor (see EPDT Section H.2) were hidden from the participants’ view. Summary for the additional environmental metrics we measured are displayed in EPDT Table H.1.2.

H.2 Temperature Temperature was the treatment we isolated, so the reasoning for its inclusion in the mea- surement and control suite is clear. To measure temperature, we installed eight sensors in each room. Two were installed at opposite ends of the room to measure average room tem- perature. The temperature values reported in the main text for this study were obtained by averaging the values reported from these two loggers. The remaining six were installed underneath the desks of the study participants to measure individual-level air temperature deviations across each room. During the pilot study, it was determined that these sensors were affected by the body heat of the participants. Thus, it proved difficult to use this data to quantify within-room temperature variation across participants. While air temperature was our primary treatment, evidence suggests that indoor ther- mal comfort is driven by operative temperature (Comite Europeen de Normalisation, 2007;

3 Table H.1.1: The sensing equipment used in both California and Nairobi studies

Metric Sensor Indoor Air Temperature/RH HOBO UX100-003 Temperature/Relative Humidity 3.5% Data Logger Outdoor Temperature/RH HOBO MX2302 External Temperature/RH Sensor Data Logger Indoor Operative Temperature Custom instrument based on Dallas DS18B20 Temperature Sensor CO2 HOBO MX1102 CO2 Logger Background Noise RISEPRO Digital Sound Level Meter 30 Lighting Minolta Illuminance Meter T-1H handheld / Dr. Meter LX1330B Digital Illuminance/Light Meter1

Due to sensor availability and cost, the Minolta sensor was used in California and the Dr. Meter sensor in Nairobi.

International Organization for Standardization, 2005; Huizenga et al., 2006). Operative temperature (Top) is practically defined as the average of air temperature and mean radiant temperature, and was measured using a Dallas DS18B20 Temperature Sensor placed inside a ping pong ball coated in grey paint. This approximates the size, shape, and color of the ideal Top sensor outlined in Simone et al. (2007). The sensors were hung close to the middle of each room, such that the surface of the ping pong ball was exposed to as many of the room’s surfaces as possible. While this was visible to participants, the appearance of the ball was not clearly one of a temperature sensor, so we did not believe that it would expose the nature of the experiment to participants. To validate the calibration of the Top sensors across experimental rooms, they were placed in the same room and left for 60 minutes. The readings obtained from the final 10 minutes were analyzed and found to lie within 0.2 degrees of each other, with no discernible bias. Operative temperature was measured for a subset of sessions in both California and Nairobi. The comparison between Top and T (where T represents air temperature) is shown in EPDT Figure H.2.1. The distributions of Top are generally similar to those of T , indicating that our air temperature treatment mechanism affected indoor thermal comfort (proxied by operative temperature) as intended. Notably, slightly wider distributions are observed in Top than in T , especially in Nairobi, and include many potential outlier measurements. This indicates either that the measurements for Top were noisier than those of T and/or that some deviations between air and mean radiant temperature occurred during the experiment, cre- ating some noise in the relationship between air temperature and perceived thermal comfort. However, the mean differences between control and treatment groups are similar within a given site across both Top and T , with a slightly stronger treatment effect observed for Top in Nairobi. One will also note in this figure the slightly higher treatment temperatures observed in Nairobi across both T and Top, as compared with California. This systematic offset was a function of the different lab settings and equipment. Outdoor temperature (and relative humidity) measurements were also obtained at both

4 Table H.1.2: Measurements of additional environmental metrics in the treatment and control rooms in both California and Nairobi

Nairobi California Control Heat n diff Control Heat n diff Variable

Top (°C) 22.1 33.3 18 ** 22.4 29.9 30 ** (1.6) (4.8) (0.8) (1.1) CO2 (ppm) 2036.8 903.3 18 ** 817.6 867.3 6 (330.1) (134.7) (87.0) (90.1) Lighting (lux) 76.7 80.4 6 466.8 410.8 6 (13.4) (18.7) (171.3) (224.9) Background Noise (dBa) 48.4 55.5 6 ** 44.0 47.0 6 N/A (3.5) (1.7)

Note: Top and CO2 measurements were collected during a subset of , while lighting and back- ground noise measurements were collected only once. Standard deviations displayed for Top and CO2 refer to in the average value collected for a given metric during a given session, across sessions. Standard deviations displayed for lighting and background noise refer to variance in measurements collected at each of the six participant workstations, across workstations. In California, background noise was only collected in the center of the room, so no standard deviations are reported. sites using a HOBO MX2302 External Temperature/RH Sensor Data Logger. These mea- surements allowed us to characterize the climate at each site. Outdoor temperature and relative humidity measurements are depicted in Appendix Figures A.2.1 and A.2.2.

5 Figure H.2.1: Distribution of 5-min-interval air temperature and operative temperature dur- ing experiment sessions

Note: (A) shows the measured temperatures in the control room, (B) shows those measured in the treatment room, and (C) shows the distribution of instantaneous differences across rooms. For each distribution, the median is indicated by a white dot, the IQR by a thick black line, and the limits of points classified as non-outliers by a thin black line. This figure only includes data from the subset of sessions for which we had a Top sensor installed.

H.3 Relative humidity Thermal comfort is also driven by relative humidity (RH). Low RH can cause irritation to mucous membranes in the eyes and nose, reducing cognitive performance, especially in visual tasks. Meanwhile, at high temperatures, high relative humidity can inhibit the body’s ability to displace heat through sweat and can decrease perceived indoor air quality (IAQ) (Fang et al., 2004; Wolkoff and Kjærgaard, 2007). RH is a measure of actual water vapor pressure in the air, relative to the saturated vapor pressure. Since warm air has a higher saturated vapor pressure, heating of indoor air is correlated with a decrease in RH. Due to this phenomenon, we see higher RH levels in our control group than our treatment group (EPDT Figure H.3.1). Humidity levels are also generally higher in Nairobi than in California, which is consistent with ambient outdoor RH trends (Appendix Figures A.2.1 and A.2.2). Literature on the impacts of indoor relative humidity on cognitive performance does not conclusively identify any clear thresholds beyond which measurable effects occur (Wolkoff, 2018), yet the range experienced by our participants certainly does not fall above any pro- posed upper bounds to a nominal range. In the treatment rooms, particularly in California, we do experience low RH levels that occasionally fall below the sensor’s measurement thresh- old of 15%. The majority of cognitive and behavioral effects of low RH are assumed to occur through dryness (Lagercrantz et al., 2003), which can affect performance in tasks where

6 blink rate matters (such as our cognitive performance modules) (Wyon et al., 2006), or could potentially influence an individual’s affect; however, effects of low RH on blink rate and perceived IAQ are typically found to be small, especially over short exposure times (Lagercrantz et al., 2003). Without firm thresholds to guide an estimate of the treatment effect of the observed RH differences, we use the lack of an observed treatment effect in the cognitive modules as an indication that there is likely little impact of low RH on module performance. Future studies could employ active humidification in the treatment room to maintain a constant RH across control and treatment groups. Outdoor RH was also measured for inclusion in exploratory analyses as a potential effect modifier. See Appendix Figures A.2.1 and A.2.2 for a depiction of the collected data.

Figure H.3.1: Distribution of 1-min-interval indoor RH measurements during experiment sessions

Note: (A) shows the measured RH values in the control room, (B) shows those measured in the treatment room, and (C) shows the distribution of instantaneous differences across rooms. For each distribution, the median is indicated by a white dot, the IQR by a thick black line, and the limits of points classified as non-outliers by a thin black line.

H.4 Carbon dioxide

Carbon dioxide (CO2) can also influence comfort and performance in indoor environments. Two primary theories exist about the mechanism for impact. The first suggests that there is a direct effect of CO2 (Satish et al., 2012), but this is a controversial and disputed finding. The more strongly supported theory is that CO2 is highly correlated with ventilation rates, and that indoor pollutant concentrations increase at low ventilation rates (Zhang et al., 2017; American Society of Heating et al., 1968). Regardless, high CO2 levels have been shown to

7 be correlated with Perceived Air Quality (PAQ), affect, and performance (Allen et al., 2015; Zhang et al., 2017). To assess any differential ventilation rates across rooms, we measured CO2 in each room (control and treatment) across both sites for a subset of the sessions. The distributions of CO2 measurements are shown in EPDT Figure H.4.1. In California, CO2 levels were relatively constant across the two rooms; in Nairobi, we observed larger differences that are driven by a higher mean (and variance) of CO2 concentrations in the control room. This is likely the result of a slightly smaller and stuffier room being used as the control in Nairobi. As expected, CO2 levels increase over the duration of each session due to the presence of multiple participants in a small experiment room (not shown). It is expected that low ventilation rates in the control room would contribute negatively to affect and performance, thus reducing any true treatment effect of temperature rather than increasing it. For this reason, the difference in CO2 concentrations in Nairobi rooms should not confound our significant findings; nevertheless, future studies would do well to confirm this hypothesis through strict control of CO2 concentrations across treatment and control rooms.

Figure H.4.1: Distribution of 5-min-interval CO2 measurements during experiment sessions

Note: (A) shows the measured CO2 values in the control room, (B) shows those measured in the treatment room, and (C) shows the distribution of instantaneous differences across rooms. For each distribution, the median is indicated by a white dot, the IQR by a thick black line, and the limits of points classified as non-outliers by a thin black line.

H.5 Sound and lighting Background noise (Ryherd and Wang, 2008; HYGGE and KNEZ, 2001; Kim and de Dear, 2013) and ambient lighting (Baron et al., 1992; HYGGE and KNEZ, 2001; Hoffmann et al.,

8 2008) have also been shown to be associated with performance, interpersonal behavior, and/or subjective perceptions of comfort. We assume these metrics to be constant across sessions, so measurements were only obtained once. For each metric, measurements were taken immediately prior to a session and were obtained twice at each workstation, for a total of 12 measurements per room. The means and standard deviations of these measurements are shown in EPDT Table H.1.2. Lighting levels show a difference across sites, yet do not significantly vary across treatment arms within a site and thus are not expected to bias our estimated treatment effect. In addition, the fact that individuals are interacting with the experiment modules through a backlit computer screen mitigates the vision-impacts of low lighting – one major pathway through which lighting levels can affect productivity. In California, low noise levels are observed in both rooms and a small, 3 a-weighted decibel (dBa) difference across rooms is unlikely to have a measurable effect on performance, affect, or interpersonal behavior. In Nairobi, higher noise levels were observed, along with a greater difference in noise levels across condition groups. It is unclear whether this 7 dBa difference across control and treatment rooms (48.4 vs. 55.5 dBa) would have a physically meaningful treatment effect on the participants, but it is a smaller difference than is typically employed in studies of the effect of sound levels (such as HYGGE and KNEZ (2001)). The lack of an observed treatment effect on productivity measures is further evidence that the difference in sound levels across rooms did not have a meaningful impact on the participants. Because HVAC equipment often creates noise, it can be difficult to simultaneously create a temperature difference across rooms in a building while maintaining similar sound levels. Nevertheless, future studies would benefit from controlling for noise levels across conditions (perhaps through the installation of a white noise machine in any quieter rooms, though the frequency and quality of the noise has also been shown to have a productivity impact in addition to the volume (Ryherd and Wang, 2008)).

9 I Lab Protocol and Specifications

The experimental treatment consisted of temperature manipulation. In both California and Nairobi, the target temperature in the control room was set to 22◦C, while the target temperature in the treatment room was set to 30◦C. In incentivized modules, participants played over tokens, where the conversion rate was 3.5 tokens to 1 Kenyan shilling in Nairobi and 2.1 tokens to 1 cent in California.2 Participants were drawn from the University of California, Berkeley for the California and the University of Nairobi for the Nairobi sample.3 While the results of the study are subject to the use of university students, who may not necessarily be representative of the population as a whole (Sears, 1986), the use of student subjects does not intrinsically pose a problem for the study’s external validity (Druckman and Kam, 2011). For each session, we aimed to recruit 12 participants who were randomly assigned to either the control or treatment group (the latter groups defined as a session-condition).4 In Nairobi, once the 12 participants were seated in the waiting room, we had two different colored sets of place cards numbered 1 to 6. All the cards were put into a big envelope and shuffled. A card was randomly pulled out and given to a participant in the order that they were seated. The color of the place card one received determined whether the participant was in the control or the treatment group. The participants in the control group were assigned a seat in the control room, whereas those in the treatment group were assigned a seat in the treatment room. Meanwhile, in California, randomized sequences of numbers were generated for every session and used to assign participants who sat in a waiting area to stations in the control and treatment room. To better ensure sufficient exposure to treatment for all modules, participants at each location were made to wait 20 minutes before beginning the experimental modules. During this time, they were asked to sit quietly at their seat and received instructions about the experiment. Each participant’s desk was prepared with a water bottle, a notepad for writing, and a pen. Photos of the experimental set up are included below. 2For context, 1 US dollar is equal to about 100 Kenyan shillings. The exchange rate was previously 1.75 tokens to 1 cent in California, but was changed early in the experimental timeline on September 28th, 2017, to both keep at the original target payment of $25 an hour and to sustain recruitment. After all experiments were completed, average earnings in California were around $35, while average earnings in Nairobi were around 2000 Kenyan shillings. 3Due to the University of Nairobi closing temporarily because of political events, some recruitment from Strathmore University did occur. Please see EPDT Section J for more discussion. 4In practice, while we overrecruited at 50% to better ensure having 12 participants show up, we were not always able to secure 12 participants. If 11 were available, we would run sessions, while canceling sessions where only 10 people or below showed up.

10 Figure I.1: Nairobi treatment room

11 Figure I.2: Nairobi control room

12 Figure I.3: California treatment room

13 Figure I.4: California control room

Note: this photo was taken during clean up, after a session was conducted.

14 The modules of the experiment were ordered in the following way:

1. Precision task (Gill and Prowse, 2012).

2. Fairness (Alm˚aset al., 2010).

3. Risk (Eckel and Grossman, 2008).

4. Time (for Patience and Time inconsistency) (Andreoni et al., 2015).

5. Trust (Johnson and Mislin, 2011).

6. Public contribution (Fischbacher et al., 2001).

7. Fluid intelligence (Penrose and Raven, 1936).

8. Joy of Destruction (Abbink and Sadrieh, 2009).

9. Cognitive reflection (Frederick, 2005).

10. Short survey, including affect questions for happiness and alertness (Russell and Bar- rett, 1999).

11. Charitable donation (Andreoni, 2006).

Below, we first describe the modules that measure economic preferences, that is Risk (for Risk-taking and Rational choice violation I) and Time (for Patience and Time inconsis- tency). Second, we describe the modules that elicit social behavior, which include Fairness (a real effort dictator game), Trust (a trust game), Public contribution (a public goods game), Destructive behavior (the Joy of Destruction game), and Charitable donation. Third, we describe the modules that measure affect, namely a non-incentivized survey questions re- lated to alertness and happiness. Last, we describe the modules that measure cognition, namely the Precision task (which serves to create endowments for the Fairness module), the Fluid intelligence task (which serves to create endowments for the Joy of Destruction), and the Cognitive reflection task (part of the non-incentivized survey held towards the end of the experiment). Outcomes from the modules used in analyses are featured in italics in parentheses below.

I.1 Economic preferences I.1.1 Risk The Risk module elicits risk preferences using choice over lotteries with equal probability. There are two menus to choose from, each tracing out a different budget line: Menu A has a slope of -2, per Eckel & Grossman (2008), while menu B has a slope of -1 (and thus has the same expected payout along the entire budget). The intercepts are 2880 tokens and 2160 tokens, respectively. For both menus A and B we include the risk neutral point (H = 0), the risk averse choice (H = T), one choice below the 45-degree line (H > T), and the intersection point of menus A and B. Of the remaining seven points, three fall above the intersection and

15 three fall below the intersection. The point below the 45-degree line has the same variance as the point between the 45 degree line and below the intersection of the two lines. The purpose of the risk module is twofold. First, the module enables us to look at the ef- fect of temperature on risk preferences. One primary outcome variable used for this purpose is Risk-taking, defined as the variance of the coin toss from menu A, in tokens.5 Second, the module allows us to identify transitivity violations (by the indicator of transitivity vio- lation using both menus A and B) and first order stochastic dominance violations (by the indicator of choice of coin 7 in menu A). Risk-taking (risktaking) and transitivity viola- tion (transitivity, renamed as Rational choice violation I in the experiment) are analyzed as primary outcomes of interest. Meanwhile, the outcome capturing first order stochastic dominance violation (F OSDviolation), which is renamed as Rational choice violation II, is analyzed for exploratory analysis. The spoken instructions for menu A were the following:

“In this task, you make a choice between different amounts of tokens that you can receive based on a toss of a coin. You have the choice between seven coins, each with different payouts for heads and tails. All coins have an equal chance of landing on heads or on tails. Once you have chosen, the computer will toss a virtual coin representing the coin you have chosen. Your earnings from this part of the experiment depend on your choice and the result of the coin toss. Therefore, it is in your interest to carefully pick the option you truly prefer. These are the 7 different coins from which you will choose one:

• Coin 1: 0 tokens if heads and 2880 tokens if tails

• Coin 2: 240 tokens if heads and 2400 tokens if tails

• Coin 3: 480 tokens if heads and 1920 tokens if tails

• Coin 4: 720 tokens if heads and 1440 tokens if tails

• Coin 5: 840 tokens if heads and 1200 tokens if tails

• Coin 6: 960 tokens if heads and 960 tokens if tails

• Coin 7: 1080 tokens if heads and 720 tokens if tails

Remember, there is no correct answer; what we are interested in is your personal preference. Please press “Next” when you are ready to proceed, and please wait afterwards.”

5Although the pre-analysis plan specified that the risk measure outcome would be a categorical variable indicating coin choice from menu A, both interpretation and accounting for multiple hypothesis testing motivated the primary outcome for risk to be defined as above. In Supplement Section G.2 we also investigate the effect of the temperature treatment on risk where 1) we assume a mean-variance utility function and use the backed out lambda parameter as the outcome variable with OLS, and 2) use a categorical indicator of choice with an ordered logit and ordered probit estimation.

16 Figure I.1.1.1: Risk module budget A

The spoken instructions for menu B were the following:

“You will now make a choice among a different set of 7 coins. All coins have an equal chance of landing on heads or on tails. The computer will toss another (different) virtual coin. Your earnings from this part of the experiment depend on your choice and the result of the coin toss. Therefore, it is in your interest to carefully pick the option you truly prefer. These are:

• Coin 1: 0 tokens if heads and 2160 tokens if tails

• Coin 2: 240 tokens if heads and 1920 tokens if tails

• Coin 3: 480 tokens if heads and 1680 tokens if tails

• Coin 4: 720 tokens if heads and 1440 tokens if tails

17 • Coin 5: 960 tokens if heads and 1200 tokens if tails

• Coin 6: 1080 tokens if heads and 1080 tokens if tails

• Coin 7: 1200 tokens if heads and 960 tokens if tails

Again, remember, there is no correct answer; we are interested in your personal preference. Please press “Next” when you are ready to proceed.”

Figure I.1.1.2: Risk module budget B

I.1.2 Time We use the traditional protocol for eliciting so-called “beta-delta” preferences, namely a convex time budget (CTB) design following Andreoni et al. (2015). We do this for the overall population by treatment, as well as by site and gender. Each participant was shown two budget lines for today vs. 3 weeks and two budgets for 3 vs. 7 weeks. We use the structural estimation procedure to collect aggregate estimates of discounting (δ) and time inconsistency (β) parameters for both Patience and Time inconsistency, respectively.

18 The purpose of this module is again twofold. First, we test the effect of temperature on Patience and Time inconsistency. Second, for exploratory analysis we test the effect of temperature on the tendency to violate transitivity (a Generalized Axiom of Revealed Preference violation, and renamed as Rational choice violation III in the experiment), as well as on the individual rank of choices in menus A through D. The primary outcome variables of interest are the aggregate estimates for discounting (δ) and time inconsistency (β). The spoken instructions were the following:

“On the following screens you will find a series of four questions. In each question, you are asked to make decisions involving earnings over time. Each row features a series of options, consisting of a sooner payment and a later payment. There is a trade-off between the sooner payment and the later payment across the options – as the sooner payment goes down, the later payment goes up. Please choose the option that you prefer. There are no right or wrong answers. After you make your choices for all four questions, the computer will randomly select one of the questions. Your earnings from this part of the experiment are whatever option you choose for the question selected by the computer. Recall that tokens will be converted to actual currency after the experiment. Tokens will be converted to actual currency at the same conversion rate, irrespective of when payments are made. Remember that the computer will pick only one question and any question could be picked. It is in your interest to carefully answer each question, and in each question it is in your best interest to pick the option you truly prefer. If everything is clear, please press “Next”.”

Figure I.1.2.1: Time module instruction

19 Figure I.1.2.2: Time module menus

20 I.2 Social behavior I.2.1 Fairness (Real Effort Dictator Game) Earnings from the Precision task are used to determine earnings in order to establish clear entitlements for the real effort dictator game. Within the latter, we match participants with equal productivity (either high or low) and give them the information about what each of them have earned in the precision task, i.e., we give them a clear suggestion that the fair outcome is an equal split. Participants are matched in pairs and asked how much of the joint earnings (2400 tokens in the high group, 1200 tokens in the low group) they want to transfer to the other participant. All participants will act as dictators and they will know that there is a fifty percent chance that their decision will be implemented (see e.g., Alm˚as et al., 2010). The purpose of this module is to study whether temperature affects pro-social behavior. The primary outcome variable is the share of the total amount of tokens that one starts with that is allocated to the other participant (fairness). The spoken instructions were as follows:

“You and another participant are now paired. You have both worked on the slider task and each earned the same amount based on your performance. Together, the two of you have earned twice as much. You will now decide how much of the total you suggest for yourself and how much you suggest for the other participant. You are free to choose any amount of tokens. The other participant in your pair will also decide how to distribute the earnings between the two of you. The computer will randomly choose which one of the decisions, either yours or the other participant’s, to implement. There is an equal chance that your choice is implemented and that the other participants’ choice is implemented. Earnings are based on the implemented choice. Please note that the other participant will not be able to know that you are paired with them, nor will you know which other participant is paired with you. Your choice is anonymous. Press “Next” when you are ready to continue.”

Figure I.2.1.1: Real effort dictator game

21 I.2.2 Public contribution (Public Goods Game) The public goods module consisted of a standard public goods game with 3 players and a multiplier of 2 (see Fischbacher et al., 2001, who use a 4-player public good game with a multiplier of 1.6). The purpose is to elicit how prosocial behavior – in particular cooperation – may be affected by temperature. The primary outcome variable of interest is cooperation measured by the amount of tokens put into the fund (cooperation). As an exploratory outcome, we also measure one’s correct guesses about others’ contributions to the fund (beliefs), measured through an indicator variable for whether the individual guesses correctly about another’s contribution within their group. The spoken instructions are as follows:

“You are now randomly matched with two other participants. You and the other partic- ipants are each endowed with 1200 tokens, as indicated on your screens. You must decide how much of your endowment to put into a shared fund. At the same time that you make your choice, the other two participants will simultaneously decide how many tokens to put into the shared fund. Each token put into the fund is multiplied by 2. The shared fund is then split equally between the three of you. Each token you choose not to put into the fund is yours to keep. Your earnings from this part of the experiment depends on your choice and the choice of two other participants in the experiment. We will now give you a few examples: If you put 1200 tokens, and both of the other participants put 1200 tokens each, there is a total of 3600 tokens in the shared fund. This amount is multiplied by two to get 7200 tokens, and the fund is shared equally among you, so you each get back 2400 tokens. If on the other hand, no one puts anything into the shared fund, you each keep the original endowment of 1200 tokens. And if one of you puts in 1200 tokens and two others do not put anything, then the shared fund will be 2400 tokens. Each of you gets 800 tokens back from the fund. The person who put the tokens in ends up with 800 tokens and the other two end up with 2000 tokens. Let’s do another example to make sure everyone understands the instructions. Say you put 600 tokens into the pot, and one of the other participants puts in 300 tokens and the third participant puts 600 tokens into the pot, how many tokens would you get in the end? [Calculate example for them: each would get back 1000 tokens, you would have your 600 tokens that you kept plus the 1000 tokens so you end up with 1600 tokens]. Please note that the other participant will not be able to know that you are paired with them, nor will you know which other participants are paired with you. Your choice is anony- mous. Press “Next” when you are ready to continue. Now that you have selected how much you put into the shared fund, we would like to know what you believe to be the amount that each of the other players put in the fund. Depending on how accurate your guess is, you have an opportunity to earn tokens. If you guess correctly, you will receive an extra 175 tokens.”

22 Figure I.2.2.1: Public goods module

I.2.3 Trust In the trust module, participants are matched in pairs. They play the game twice, each time with a different partner. Participant A is given an initial amount of tokens (X; 600 tokens in practice) and the other participant, Participant B, is not given any endowment.6 Participant A decides how many tokens, Y , to pass on to Participant B. This amount is multiplied by 3. Participant B then decides how much Z, of 3 ∗ Y to send back to Participant A. Participant A’s payment is X −Y +Z, and Participant B’s payment is 3∗Y −Z. (see Johnson and Mislin, 2011, for a similar design). The purpose is again twofold. First, we study how temperature affects the share of tokens sent to Participant B (often referred to as “trust”), and second, we

6Note that everyone plays as Participant A before playing as Participant B, because we deliberately want to give priority to the measurement of sending behavior over “sending back” behavior.

23 study how temperature affects the share sent back (of that pool that Participant B started with) to Participant A (often referred to as “trustworthiness”). The main motivation for studying this behavior is that trust may be important for societies’ economic and social performance (see, e.g., Knack and Keefer, 1997). The mechanisms behind the observed behavior may be several, including trust, efficiency, fairness, inequality, and self interest, while the most plausible mechanisms for sending back include fairness and reciprocity. The purpose of this module is not to separately nail down the strength of these separate potential mechanisms (we are not equipped to do so) for the observed behavior, but rather to establish whether there is a change in sending behavior attributable to temperature. The share of the total amount that one begins the module with sent to the other participant is the primary outcome variable of interest (trust), and the share of tokens sent to the participant in the second round (sharesentback) is an exploratory outcome variable of interest. The spoken instructions were the following:

[First Round] “In this part of the experiment, you are endowed with 600 tokens. In the next round another participant, “B”, will be matched with you. Participant B is not given anything. You can choose to send any amount, between 0 and 600 tokens, to Participant B. Any amount not sent to Participant B, is yours to keep. Any amount you send to Participant B is multiplied by 3. In the second round, Participant B can choose to send back some or all of this amount to you. Any amount that is not sent back to you by Participant B is kept by Participant B. We will now give you a few examples. Suppose you send Participant B your full endow- ment of 600 tokens. This amount is multiplied by 3 to make 1800 tokens. If Participant B chooses to send back 0 tokens, you end up with 0 tokens and Participant B has 1800 tokens. If instead, Participant B chooses to send back 600 tokens, you end up with 600 tokens and Participant B has 1200 tokens. If instead you send Participant B 0 tokens, then you will have kept 600 tokens and Participant B will have 0 tokens. Suppose instead that you send Participant B 500 tokens, some of your endowment, this amount is multiplied by 3 to make 1500 tokens. If Participant B chooses to send back 900 tokens then you will have 900 tokens plus the 100 tokens left over from your endowment for a total of 1000 tokens, and Player B will have 600 tokens. Your earnings from this part of the experiment depend on the choice you make in this first round as well as the choice of one other Participant B in the next round. Please note that the other participant will not be able to know that you are paired with them, nor will you know which other participant is paired with you. Your choice is anonymous. Press “Next” when you are ready to proceed. [After 1st round ends, read the following] Second round (Participant B): You are now randomly matched with another participant (not yourself) from the first round, Participant A. In the first round Participant A chose to send an amount of tokens that will be indicated on your screen. This amount has been multiplied by 3, also indicated on your screen. You can now choose to send back some of this multiplied amount to Participant A. Any amount not sent to Participant A, will be yours to keep. Your earnings from this part of the experiment depend only on the choice you make in this round. Please note that the other participant will not be able to know that you are paired with them, nor will you know which other participant is paired with you. Your choice is anonymous. Press “Next” when you are ready to proceed.”

24 Figure I.2.3.1: Trust module

I.2.4 Joy of Destruction Participants were informed that everyone had earned different amounts (up to six) of $1 Amazon gift cards in California or Airtime vouchers worth 50 Ksh each in Nairobi from the Fluid intelligence task. They were then matched into anonymous pairs (with others in their assigned treatment room) and told that their partner has earned X. They could destroy any number between 0 and X. The computer could also destroy some of the remaining vouchers (after flipping a virtual coin, it did not destroy anything if the coin was heads. Otherwise it destroyed any number of the remaining cards). The lab assistant destroyed the total number of cards given by the computer’s and participant’s choice. The other participant did not know whether the earnings were destroyed because of the computer or because of the other participant’s decision (though some inference is possible). The idea here is that participants can partly hide their purposeful destruction behind random destruction. The game was

25 conducted with gift cards/vouchers, so that actual destruction (and not reallocation) can take place. The purpose in this experiment is to measure whether willingness to destroy increases with temperature (see (see Abbink and Sadrieh, 2009)). The primary outcome considered here is the measure of destruction, measured by the proportion of the partner’s gift cards or vouchers destroyed by the participant (destroyed). The spoken instructions were the following:

“You are now randomly matched with another participant. You have both completed the puzzle task and earned a number of $1 Amazon gift cards . You may now choose that some of your partner’s cards are to be destroyed. If you decide to do so, the lab assistants will later destroy the gift cards like this. Again, you are anonymous and the other participant cannot identify your choice. Please note that the other participant will not be able to know that you are paired with them, nor will you know which other participant is paired with you. Your choice is anonymous. After your choice has been made, the computer will flip a virtual coin. If the coin is heads, the computer does not destroy anything. If it is tails, the computer will randomly choose whether to destroy some of the participant’s remaining gift cards. The other player will never be able to distinguish how many gift cards are destroyed because of your choice, and how many are destroyed because of the computer. Your choice does not affect your own earnings, only the earnings of the other participant. Here are some examples: [Take the appropriate number of sheets of paper provided and illustrate the physical destruction of a gift card whenever an example participant decides to destroy X number of gift cards.]

1. The other participant has 6 $1 amazon gift cards. You decide not to destroy any of the gift cards and the computer also randomly does not destroy any gift cards. Hence, we will not destroy any of the other participant’s gift cards.

2. The other participant has 6 $1 amazon gift cards. You decide not to destroy any of the gift cards but the computer randomly decides to destroy 3 gift cards. Hence, we will destroy 3 of the other participant’s gift cards.

3. The other participant has 6 $1 amazon gift cards. You decide to destroy 2 of these and the computer randomly decides to destroy 1 of them. Hence, in total, we will destroy 3 of the other participant’s vouchers.

4. The other participant has 6 $1 amazon gift cards. You decide to destroy 3 of these and the computer randomly decides to destroy 0 of them. Hence, we will destroy 3 of the other participant’s vouchers.

5. The other participant has 6 $1 amazon gift cards. You decide to destroy 3 of the gift cards and the computer randomly decides to destroy 3 gift cards. Hence, we will destroy all of the other participant’s gift cards.

The other participant you are paired with will also be given an opportunity to destroy some of your gift cards. At the conclusion of the experiment the number of vouchers belonging to you that the other participant chose to destroy, in addition to the ones randomly destroyed by

26 the computer, will be destroyed. Your earnings from this part of the experiment depend only on the choice the other participant makes and the random choice of the computer. Please press “Next” when you are ready.”

Figure I.2.4.1: Joy of Destruction

I.2.5 Charitable donation At the very end of the experiment, after the survey, participants were offered to donate part of their earnings to a charity. Participants were randomly allocated to charities on a list (7 in Busara, 6 in US) and can donate a percentage (up to 40%) of their earnings. We wanted to know whether higher temperatures make participants more or less likely to give, and also as a secondary outcome, whether they are more or less likely to reveal potential in-group biases based on ethnicity (in Kenya) or residency status (in California). Thus, the charities that were selected in Nairobi were correlated with different charities in Kenya that were associated with different ethnicities, while the charities that were selected in California were either nationwide charities or were based and served the San Francisco Bay Area. The primary outcome variable of interest is the absolute amount of tokens donated by the participant (donation).

27 Figure I.2.5.1: Charitable donation

I.3 Affect For exploratory analysis gauging the effect of temperature on affect, we included two Likert- type scales. One scale asks “On a scale from 1-7, with being sad and 7 being happy, how do you feel right now?” and the other scale asks “On a scale from 1-7, with being tired and 7 being alert, how do you feel right now?” These scales appeared at the end of the experiment, between the trust scale and other demographic questions. The value coming from the former scale (happiness) and the latter scale (alertness) are included as exploratory outcomes.

Figure I.3.1: Happiness and alertness scales

28 I.4 Cognition I.4.1 Precision task The participants were asked to place a slider on an assigned number from 1 to 100 using the touchscreen in Nairobi (or mouse in California). For each slider, the participant received one point if the number was correct, and 0 otherwise. The participant was to complete as many of these slider tasks as possible within three minutes. Final earnings from the Precision task are either “high” (weakly above median) or “low” (below median) precision, while the median was calculated within the treatment cohort and was randomly assigned to either high or low. The purpose of this module is two-fold. First, it enables us to measure the effect that temperature has on precision, and second, it provides the necessary work effort to create real effort stakes in the dictator game. The primary outcome variable of interest is the total number of points earned (precision, absolute, not normalized). The spoken instructions were the following:

“In this part of the experiment, you earn tokens from conducting a sequence of tasks over 3 minutes. For each task, you are asked to move a slider to match an image shown to you. Try to correctly move the slider to match the goal, and then hit “Next”, as many times as you can, within 3 minutes. The more times you match the slider correctly, the more tokens you earn. Your earnings depend on your performance. If you have any questions, please raise your hand and I will come to you. We will now have a practice round, so you can become familiar with the task. This practice round consists of two slider tasks, performed sequentially. After everyone is finished with the practice round, please wait until instructed to move to the round with actual earnings, with slider tasks appearing sequentially. Please note that you do not earn any tokens from the practice round. When you are ready, press “Next”. [Practice Round]: The practice round is complete. We will now start the actual task. Please press “Next” to continue.”

29 Figure I.4.1.1: Precision task

I.4.2 Fluid Intelligence To measure fluid intelligence, we use Raven’s matrices. Participants were not told how many matrices they completed correctly in the module, and they were not given their payment following this module. The purpose is to test cognitive ability, which enables us to first identify the effect of temperature on mental acuity, and second to create destructible earnings for the joy of destruction task (see (see Penrose and Raven, 1936)). The primary outcome variable of interest is proportion of puzzles answered correctly (puzzles). The spoken instructions were the following:

“In this part of the experiment you will complete a series of tasks. For each task, you are shown an image with a piece missing. Identify the correct answer by selecting the missing piece that completes the image. You will complete two practice rounds to familiarize you with the type of task you will be completing in this part of the experiment. You will then be given 6 real tasks. Your earnings from this part of the experiment will be given in the form of $1 amazon gift cards . Your earnings depend on the number of tasks (out of six possible) that you complete correctly and on a decision that another player makes in the next round. Please press “Next” when you are ready.”

30 Figure I.4.2.1: Raven’s matrices instruction

Figure I.4.2.2: Raven’s matrices practice rounds

I.4.3 Cognitive reflection As part of the non-incentivized survey, we used a cognitive reflection test (CRT) that con- stitutes five survey questions to elicit potential treatment effects on cognitive reflection. The five survey questions are all standard versions of standard such questions (see e.g., Frederick, 2005). The purpose is to study whether temperature affects cognitive reflection. We study whether there is a treatment effect on CRT score, and whether such a potential effect is driven by an increase in general noise or by a decrease in the of overriding the intuitive answer. The primary outcome variable of interest is the share of questions answered correctly (sharecorrect). Exploratory outcome variables of interest include time spent an- swering all five questions (timespentoverall)7 and the probability that an incorrect question was given an intuitive answer (answerintuitive, where the latter was to be measured if

7The maximum amount of time that participants were allowed on the page was 3 minutes.

31 only there was a difference in the share of questions answered correctly).8 As the CRT was given as part of the non-incentivized survey that follows the experiment (after asking about father’s and mother’s education), there were no separate spoken instructions for the CRT.

Figure I.4.3.1: Cognitive reflection test

I.5 Demographic survey questions The demographic survey included non-incentivized questions on trust, body mass, demo- graphics, and parental education. Several of the variables formed from these questions are included in several regression specifications in the case of imbalance across treatment arms, as pre-specified. The questions that form part of this survey include:

• Trust measure (Dohmen et al., 2011) (also used as an exploratory outcome of interest)

• Height & weight

• Age

• Gender

• Ethnicity/in-state status

8Given that there was no statistically significant difference found, this latter test was not carried out.

32 • Parental employment, income, and education

Figure I.5.1: Demographic survey – trust measure

33 Figure I.5.2: Demographic survey – participant information (with question 5 in California)

34 Figure I.5.3: Demographic survey question 5 (in Nairobi)

35 Figure I.5.4: Demographic survey – parental information

36 I.6 Debriefing questions from the post-experiment survey At the end of the experiment came a set of debriefing questions from the post-experiment survey, asking participants about their perceptions of the laboratory environment, as well as their thoughts as to what the experiment was studying. The overarching purpose of this set of questions was to gauge whether the perception of room temperature as an object of study differed between treatment and control; other questions about the environment served mostly to conceal that one aim.9 Note that the debriefing questions were instituted roughly midway into the experiment, on November 13th, 2017 in California and November 29th in Nairobi.

Figure I.6.1: Debriefing questions from the post-experiment survey (1 of 5)

9However, the additional questions also broadly serve as checks that the laboratory environment was similar between treatment and control. Please see Appendix Table C.3 for the full set of responses to the debriefing questions.

37 Figure I.6.2: Debriefing questions from the post-experiment survey (2 of 5)

Figure I.6.3: Debriefing questions from the post-experiment survey (3 of 5)

38 Figure I.6.4: Debriefing questions from the post-experiment survey (4 of 5)

Figure I.6.5: Debriefing questions from the post-experiment survey (5 of 5)

39 J Experimental timeline

As mentioned in the pre-analysis plan, earlier versions of the experiment (the “pre-pilots”) were held in November 2016 and January 2017 at the Busara Center in Nairobi, Kenya. These pre-pilots were carried out largely to resolve the design issues surrounding the use of specific temperature points in a laboratory setting with multiple people, as well as to help determine what experimental modules would be included in the main experiment. The pre-analysis plan for the main experiment was uploaded to the AEA RCT on September 13th, 2017. The pilot for the main experiment was run at the Xlab in Berkeley, California, USA, from September 18th, 2017, to September 20th, 2017. This data was not used in the analysis, but was used instead to uncover any remaining issues with the experimental code. The main experiments had started and ended roughly around the same date, starting from September 25th, 2017 to February 15th, 2018 at the Xlab, and starting from September 29th, 2017 to February 15th, 2018 in at the Busara Center. However, this general timeline for the main experiment contained smaller events and changes in recruitment that occurred within the experimental timelines. To further frame discussion, there are three general categories that these smaller lab events fall into: 1) changes in recruitment relating to the political situation in Nairobi; 2) the addition of sensors to record operative temperature and other measurements for the experiment; and 3) changes in recruitment to achieve gender balance. 1. Political events: As a result of the protests and other activity in Kenya surrounding the elections in Kenya (please refer to Appendix Section A.1 for a review), we were emailed by Busara Center staff on October 3rd, 2017. In the email, staff noted that the University of Nairobi (from which we were recruiting participants at the time) had closed down temporarily, but indefinitely, requiring all students to immediately leave for home. The staff member had noted that the “political temperature around the country has been quite high, with university students actively taking part in politi- cal demonstrations, one of which ended up being violent.” The University of Nairobi was especially worried that student unrest could get out of hand, especially as the re-election date came close. To continue recruitment, participants were recruited from Strathmore University, a private university located in Nairobi. Students from Strath- more University participated from October 12th to October 19th, where lab activities in Nairobi were halted between October 24th to the 31st, given the proximity to the elections. Despite the University of Nairobi remaining closed, students from the Uni- versity of Nairobi began participating again for sessions starting November 1st (while there were no more students from Strathmore University participating after October 19th). The University of Nairobi reopened and resumed courses early November. The lab halted again on November 20th as a result of the uncertainty surrounding the ruling announcement on the presidential election, but continued afterwards. 2. Additional measurements and measurement tools: Tools for measurements not originally specified in the pre-analysis plan were brought into each site during and after the course of the experiment. These additions were made following recognition that individual participant temperature/relative humidity sensors placed at workstations were affected by body heat due to their location under participants’ desks, where they

40 were extremely close to the individuals. Other changes were made following discussion with UC Berkeley’s Center for the Built Environment (CBE), which occurred from the end of October to mid-November 2017. These discussions centered around the need to measure other environmental factors across rooms that could be influential. These tools included: 1) two additional HOBO Temperature/Relative Humidity Data Loggers, placed in more open (but still hidden) areas to more accurately record room temperature; 2) one HOBO MX1102 Carbon Dioxide Data Loggers in California (Telaire 7001 CO2 monitors were used instead in Nairobi due to their lower cost), used to mea- sure carbon dioxide; 3) one operative temperature sensor developed by UC Berkeley’s Center for the Built Environment, used to measure average CO2 levels; 4) one RISE- PRO Digital Sound Level Meter 30, used to measure background noise and used just once, when no participants were in the room; and 5) one Dr. Meter LX1330B Digital Illuminance/Light Meter, used to measure illuminance at each workstation and used just once, when no participants were in the room. Wind speed was also measured at the Xlab, but given the perception that this was not meaningful in the Busara Center, the equipment was not shipped over to the Busara Center, in part also due to cost. In California, the two additional HOBO Temperature/Relative Humidity Data loggers began recording for the first main experimental session on September 25th, 2017. On November 8th, both wind speed and lighting were recorded within the rooms. The sen- sors for operative temperature began recording on November 9th, 2017. The equipment for CO2 began recording for the sessions beginning on November 15th, while sound was recorded on November 16th. Meanwhile, in Nairobi, the two additional HOBO Temperature/Relative Humidity Data loggers began recording on October 16th. The sensors for additional measurements arrived in Nairobi only in mid-January, after 800 participants had already been reached (by December 14th). In order to have real measurements on real participants, we decided on January 16th, 2018 to increase the sample collected in each site up to 900. Thus, sound and lighting were recorded on January 22nd, and the sensors for operative temperature and CO2 began recording on January 23rd. Please refer to the addendum to the pre-analysis plan for further details.

3. Gender ratios: As noted in the addendum to the pre-analysis plan, we noted during the course of data collection that the gender ratios were not equal across sites, with the majority of participants in California being female and the majority of participants in Nairobi being male. On January 23rd, 2018 we began more explicit discussion and investigation into the potential relevance of the share of male participants in a study room. In order to have gender composition that was more comparable across sites, we increased recruitment of females in Nairobi and males in California. We initially aimed to have the increase in sample up to 900 incorporate this change in demographics, but recruitment at the Busara Center had already reached 900 participants before we could effect any change in targeting. Thus, we decided (on January 30th) to expand recruitment in the Busara Center up to 1000 participants, while keeping the target at 900 participants for the Xlab. At each site, we initially aimed to start recruiting at a site-specific 1:5 gender ratio,

41 starting the week of January 29th, 2018. Thus, in California we first aimed to recruit one female for every five males for all further sessions, and in Nairobi we first aimed to recruit five females for every one male for all further sessions. In practice, we moved to a site-specific 1:2 gender ratio for recruitment for sessions after January 31st, 2018. Figure 1 in the addendum to the pre-analysis plan shows the distribution of the proportions of male participants (weighted at the session-condition level) leading up to January 23rd, 2018. EPDT Figure J.1 displays the same information but over the entire experiment, showing that the gender balance was made more comparable across sites with the changes in recruiting.

Figure J.1: Distribution of proportions of male participants in rooms, by site

Note: This figure is comparable to Figure 1 in the addendum to the pre-analysis plan, but with updated labels: the turqoise bars for Xlab are now light green bars for California while the pink bars for Busara Center are now dark orange bars for Nairobi. Although six participants were usually recruited to a room, there were several instances were there were five participants. The above values also condition on having identified as either male or female, and thus those who preferred not to respond are not included in total number of people in a room.

EPDT Figure J.2 presents site-specific timelines of the experiment in both California and Nairobi, with each timeline also displaying the cumulative number of participants going throughout the experiment. Notable events from the above discussion are incorporated into

42 the figure, as well as discussion from the Nairobi and California political context subsections in Appendix Sections A.1 and B.1, respectively. Note that events directly relating to the lab are located directly above the timeline, while events relating to the political context are located directly below the timeline. Note also that while the original 800 participant target was reached in Nairobi by December 14th, 2017, this was not the case in California. For practical reasons, recruitment to 800 in California necessitated waiting until students were finished with exams in December 2017 and returned from break mid-January 2018.10

10Recruitment among the California pool would be difficult starting from when classes ended for the winter semester (December 8th, 2017) to when classes began for the spring semester (January 16th, 2018). Meanwhile, recruitment was more feasible during the analogous period for University of Nairobi students (November 30th, 2017 to January 9th, 2018).

43 44 Figure J.2: Experimental timeline

A: Nairobi

Additional temp/RH sensors begin recording n=1015 1000 ticipants Recruitment from Strathmore Busara Center closes for elections; Operative temperature and Begin recruiting University begins recruitment from Strathmore University ends CO2 sensors begin recording at 1:5 gender ratio; University of Nairobi closes Busara Center Sound and recruit to 1000 and sends students home opens again lighting recorded 500 Experiment University of Classes Classes Recruit Begin recruiting at

umber of par begins Nairobi reopens end begin to 900 1:2 gender ratio e n

0 ulativ Aug 2017 Sep 2017 Oct 2017 Nov 2017 Dec 2017 Jan 2018 Feb 2018

Cum General Odinga boycotts 2nd Supreme Court upholds results elections held round presidential election of 2nd round presidential election IEBC declares Kenyatta declared winner in Kenyatta victor; 2nd round presidential election violent clashes Supreme Court nullifies results, announces Government bans 2nd round presidential election held, follow nationwide 2nd round presidential election for October opposition protests following annulment of August 2017 results 45

B: California

1000 ticipants Experiment begins; additional CO2 Sound n=903 temp/RH sensors begin recording recorded recorded Operative temperature Begin recruiting sensors begin recording at 1:5 gender ratio 500 Wind speed and Classes Recruit to 900; Begin recruiting at

umber of par lighting recorded end classes begin 1:2 gender ratio e n

0 ulativ Aug 2017 Sep 2017 Oct 2017 Nov 2017 Dec 2017 Jan 2018 Feb 2018

Cum Charlottesville DACA 2018 Women's rally protests March

Note: Panel A shows the experimental timeline in Nairobi, Kenya, while Panel B shows the experimental timeline in California, USA. The colored bars display the cumulative number of participants going through the experiment. Events directly relating to the lab are located directly above the timeline, while events relating to the political context are located directly below the timeline, by site. Information on events directly relating to the lab have been drawn from EPDT Section J. Information on events relating to the political context have been drawn from Appendix Sections A.1 and B.1 for Panel A and Panel B, respectively. K Statistical power and MDEs

To determine the desired sample sizes in each site for the main experiment, we analyzed the relationship between sample sizes and the minimum detectable effect by drawing a series of “Minimum detectable effect (MDE)-sample size graphs” (for a more detailed discussion, please see the pre-analysis plan). In the analysis, we set statistical power to be 0.8, and both the significance level and the false discovery rate (FDR) to be 0.05. Standard errors were adjusted for intraclass correlations by exploiting data from the “pre-pilot” experiments conducted in between November 2016 and January 2017 (i.e., before the main experiment) at the Busara Center in Nairobi. Moreover, we also employed two methods (the Bonferroni ad- justment and the FDR adjustment) to adjust for multiple hypothesis testing. We concluded that recruiting 800 students in each site would achieve the MDE of about 0.264 standard deviations in each site separately, and about 0.188 standard deviations in the pooled sample (assuming the proportion of true null hypotheses to be 0.80). At the end of the experiment, we ended up with 903 participants in California and 1,015 participants in Nairobi, going over the 800 participant (per site) target that had been initially set out (see EPDT Section J for further discussion on additional recruitment). To see whether our sample size is indeed large enough to detect conventional effect sizes with enough statistical power, we re-calculate statistical power using the number of obser- vations, standard deviations, and intraclass correlations from the main experiment for the pooled sample, as well as the full set of responses for each primary outcome. EPDT Fig- ure K.1 shows the relationship between statistical power (on the y-axis) and the MDE in standard deviations (on the x-axis). It shows, for example, that statistical power is ap- proximately 0.8 if the MDE is 0.2 standard deviations (when the proportion of true null hypotheses π0 = 0.50). As expected and similar to what we found in the pre-analysis plan, the figure shows that the Bonferroni adjustment gives the most conservative estimates for a wide range of MDE and that the curves using the FDR adjustments fall between the multi- ple hypothesis unadjusted and Bonferroni lines for most values. EPDT Table K.2 instead shows MDEs in standard deviations when power is set at 0.8. For most primary outcomes of interest, our sample can detect an effect size of approximately 0.199 standard deviations when π0 = 0.50 and an effect size of approximately 0.237 standard deviations when π0 = 0.80. The MDE for the Joy of Destruction are slightly higher due in part to having fewer observations. The MDEs for Patience and Time inconsistency are also higher, in part as the outcomes are calculated after dropping individuals whose choices consisted entirely of corner cases and violations of GARP, thus providing fewer participant observations. Although we have a larger sample size, these estimates of the MDE are smaller than what we established in the pre-analysis plan, where in the pooled sample of 1600 participants we estimated an MDE close to 0.158 if π0 = 0.158 and an MDE close to 0.188 if π0 = 0.80. These differences are driven largely by the average Moulton factor used in our current calcu- lations (1.56, compared to 1.09 from the pilot data used in the pre-analysis plan). The two Moulton factors differ because of the differing composition of outcomes from which intraclass correlations are considered: not all outcomes that were featured in the main experiment were present in the pre-pilot, and those modules that did feature in the main experiment did not always retain the form that they did in the pre-pilot. Differences also come from having

46 collected more participant data that also yield different estimates of intraclass correlations, calculating into the Moulton factor.

47 Figure K.1: Power curves

Note: Power (y-axis) is plotted against effect sizes (x-axis), where intraclass correlations are considered. To compute power, we assume that the significance level is 0.05. The number of observations, standard derivations, and intraclass correlations are derived from our sample. The Moulton factor is constructed from the average of the intraclass correlations associated with the primary outcomes (excluding Patience and Time inconsistency) and sessions (the clusters). The average of intraclass correlations (0.13) and the group size (12) yield the Moulton factor of 1.56. We plot power curves that are (a) unadjusted for multiple-hypothesis testing, but also plot power curves that adjust for such by (b) applying the Bonferroni correction (the significance level is divided by the number of primary outcomes, i.e., 12) and (c) by controlling the false discovery rate (Type I errors) at the level of 0.05, where the proportion of true null hypotheses (π0) is assumed to be either 0.50 or 0.80. We use the full set of responses for each primary outcome. Thus, observations are 958 (for the treatment group) and 960 (for the control group) for all rows except for 1) ‘Patience’ and ‘Time inconsistency’ for which we compute aggregate estimates at the individual-menu choice level (excluding individuals who exhibit corner cases and violations of GARP), and so the respective numbers of observations are 3392 (for the treatment group, 848 participants) and 3360 (for the control group, 840 participants), and 2) ‘Joy of Destruction’, for which individuals could only destroy conditional on having a partner who earned at least one voucher or gift card, and so observations are 948 (for the treatment group) and 951 (for the control group). Standard deviations for outcomes are: ‘Production’: 7.50 (treatment), 7.09 (control); ‘Fairness’: 0.24 (treatment), 0.24 (control); ‘Risk-taking’: 429.41 (treatment), 450.56 (control); ‘Rational choice violation I’: 0.15 (treatment), 0.16 (control); ‘Patience’: 0.00640 (treatment), 0.00638 (control); ‘Time inconsistency’: 0.192 (treatment), 0.201 (control); ‘Trust’: 0.32 (treatment), 0.32 (control); ‘Cooperation’: 417.50 (treatment), 425.97 (control), ‘Fluid intelligence’: 0.17 (treatment), 0.18 (control), ‘Joy of Destruction’: 0.26 (treatment), 0.25 (control) ‘Cognitive reflection’: 0.28 (treatment), 0.29 (control), and ‘Charitable donation’: 639.05 (treatment), 636.11 (control). 48 Figure K.2: Minimum Detectable Effects under different multiple hypothesis testing adjust- ment strategies

Note: The minimum detectable effects (MDEs) in standard deviations are shown, where intraclass correla- tions are considered and when power is 0.8. To compute MDE, we assume that the significance level is 0.05. The number of observations, standard derivations, and intraclass correlations are derived from our sample. The Moulton factor is constructed from the average of the intraclass correlations associated with the primary outcomes (excluding Patience and Time inconsistency) and sessions (the clusters). The average of intraclass correlations (0.13) and the group size (12) yield the Moulton factor of 1.56. The first column shows MDEs, unadjusted for multiple hypothesis testing, across all outcomes. The second column shows MDEs, applying the Bonferroni correction (the significance level is divided by the number of primary outcomes, i.e., 12). The last two columns control for the false discovery rate (Type I errors) at the level of 0.05, where the proportion of true null hypotheses (π0) is assumed to be either 0.50 or 0.80. We use the full set of responses for each primary outcome. Thus, observations are 958 (for the treatment group) and 960 (for the control group) for all rows except for 1) ‘Patience’ and ‘Time inconsistency’ for which we compute aggregate estimates at the individual-menu choice level (excluding individual who exhibit corner cases and violations of GARP), and so the respective numbers of observations are 3392 (for the treatment group, 848 participants) and 3360 (for the control group, 840 participants), and 2) ‘Joy of Destruction’, for which individuals could only destroy condi- tional on having a partner who earned at least one voucher or gift card, and so observations are 948 (for the treatment group) and 951 (for the control group). Standard deviations for outcomes are: ‘Production’: 7.50 (treatment), 7.09 (control); ‘Fairness’: 0.24 (treatment), 0.24 (control); ‘Risk-taking’: 429.41 (treatment), 450.56 (control); ‘Rational choice violation I’: 0.15 (treatment), 0.16 (control); ‘Patience’: 0.00640 (treat- ment), 0.00638 (control); ‘Time inconsistency’: 0.192 (treatment), 0.201 (control); ‘Trust’: 0.32 (treatment), 0.32 (control); ‘Cooperation’: 417.50 (treatment), 425.97 (control), ‘Fluid intelligence’: 0.17 (treatment), 0.18 (control), ‘Joy of Destruction’: 0.26 (treatment), 0.25 (control) ‘Cognitive reflection’: 0.28 (treatment), 0.29 (control), and ‘Charitable donation’: 639.05 (treatment), 636.11 (control).

49 References

Abbink, K. and A. Sadrieh (2009). The pleasure of being nasty. Economics letters 105 (3), 306–308.

Allen, J. G., P. MacNaughton, U. Satish, S. Santanam, J. Vallarino, and J. D. Spengler (2015, oct). Associations of Cognitive Function Scores with Carbon Dioxide, Ventilation, and Volatile Organic Compound Exposures in Office Workers: A Controlled Exposure Study of Green and Conventional Office Environments. Environmental Health Perspectives 124 (6).

Alm˚as,I., A. W. Cappelen, E. Ø. Sørensen, and B. Tungodden (2010). Fairness and the development of inequality acceptance. Science 328, 1176–1178.

American Society of Heating, R., A.-C. Engineers., and A.K. (1968, dec). ASHRAE trans- actions. American Society of Heating, Refrigerating and Air-Conditioning Engineers.

Andreoni, J. (2006). Philanthropy. In S.-C. Kolm and J. Mercier Ythier (Eds.), Handbook of the economics of giving, altruism, and reciprocity, Volume 2, Chapter 18, pp. 1201–1269. North-Holland.

Andreoni, J., M. A. Kuhn, and C. Sprenger (2015). Measuring time preferences: A com- parison of experimental methods. Journal of Economic Behavior & Organization 116, 451–464.

Baron, R. A., M. S. Rea, and S. G. Daniels (1992, mar). Effects of indoor lighting (illumi- nance and spectral distribution) on the performance of cognitive tasks and interpersonal behaviors: The potential mediating role of positive affect. Motivation and Emotion 16 (1), 1–33.

Comite Europeen de Normalisation (2007). EN 15251 : 2007 — INDOOR ENVIRONMEN- TAL INPUT PARAMETER... — SAI Global. Technical report.

Dohmen, T., A. Falk, D. Huffman, and U. Sunde (2011). The intergenerational transmission of risk and trust attitudes. The Review of Economic Studies 79 (2), 645–677.

Druckman, J. N. and C. D. Kam (2011). Students as experimental participants. In J. N. Druckman, D. P. Green, J. H. Kuklinski, and A. Lupia (Eds.), Cambridge handbook of experimental political science, Volume 1, Chapter 4, pp. 41–57. Cambridge UP.

Eckel, C. C. and P. J. Grossman (2008). Forecasting risk attitudes: An experimental study using actual and forecast gamble choices. Journal of Economic Behavior & Organiza- tion 68 (1), 1–17.

Fang, L., D. P. Wyon, G. Clausen, and P. O. Fanger (2004, aug). Impact of indoor air temper- ature and humidity in an office on perceived air quality, SBS symptoms and performance. Indoor Air 14 (s7), 74–81.

Fischbacher, U., S. G¨achter, and E. Fehr (2001). Are people conditionally cooperative? evidence from a public goods experiment. Economics letters 71 (3), 397–404.

50 Frederick, S. (2005). Cognitive reflection and decision making. Journal of Economic per- spectives 19 (4), 25–42.

Gill, D. and V. Prowse (2012). A structural analysis of disappointment aversion in a real effort competition. American Economic Review 102 (1), 469–503.

Hoffmann, G., V. Gufler, A. Griesmacher, C. Bartenbach, M. Canazei, S. Staggl, and W. Schobersberger (2008, nov). Effects of variable lighting intensities and colour temper- atures on sulphatoxymelatonin and subjective mood in an experimental office workplace. Applied Ergonomics 39 (6), 719–728.

Huizenga, C., H. Zhang, P. Mattelaer, T. Yu, E. Arens, and A. P. Lyons (2006). WINDOW PERFORMANCE FOR HUMAN THERMAL COMFORT.

HYGGE, S. and I. KNEZ (2001, sep). Effects of noise, heat and indoor lighting on cognitive performance and self-reported affect. Journal of Environmental Psychology 21 (3), 291– 299.

International Organization for Standardization (2005). ISO 7730:2005 - Ergonomics of the thermal environment – Analytical determination and interpretation of thermal comfort us- ing calculation of the PMV and PPD indices and local thermal comfort criteria. Technical report, Geneva, Switzerland.

Johnson, N. D. and A. A. Mislin (2011). Trust games: A meta-analysis. Journal of Economic Psychology 32 (5), 865–889.

Kim, J. and R. de Dear (2013, dec). Workspace satisfaction: The privacy-communication trade-off in open-plan offices. Journal of Environmental Psychology 36, 18–26.

Knack, S. and P. Keefer (1997). Does social capital have an economic payoff? a cross-country investigation. The Quarterly Journal of Economics 112 (5), 1251–1288.

Lagercrantz, L. P., D. Wyon, H. W. Meyer, J. U. Prause, L. Fang, G. Clausen, and J. Sun- dell (2003). Objective and Subjective Responses to Low Relative Humidity in an Office Intervention Study.

Penrose, L. S. and J. Raven (1936). A new series of perceptual tests: Preliminary commu- nication. Psychology and Psychotherapy: Theory, Research and Practice 16 (2), 97–104.

Russell, J. A. and L. F. Barrett (1999). Core affect, prototypical emotional episodes, and other things called emotion: dissecting the elephant. Journal of personality and social psychology 76, 805–819.

Ryherd, E. E. and L. M. Wang (2008, jul). Implications of human performance and percep- tion under tonal noise conditions on indoor noise criteria. The Journal of the Acoustical Society of America 124 (1), 218–226.

51 Satish, U., M. J. Mendell, K. Shekhar, T. Hotchi, D. Sullivan, S. Streufert, and W. B. J. Fisk (2012, sep). Is CO2 an Indoor Pollutant? Direct Effects of Low-to-Moderate CO2 Concentrations on Human Decision-Making Performance. Environmental Health Perspec- tives 120 (12), 1671–7.

Sears, D. O. (1986). College sophomores in the laboratory: Influences of a narrow data base on social psychology’s view of human nature. Journal of personality and social psychology 51 (3), 515–530.

Simone, A., J. Babiak, M. Bullo, G. Landkilde, and B. W. Olesen (2007). Operative tem- perature control of radiant surface heating and cooling systems.

Wargocki, P., O. Seppanen, J. Andersson, A. Federation of European Heating and Air- conditioning Associations., D. Forssan Kirjapaino, K. Fitzner, and S. O. Hanssen (2006). Indoor climate and productivity in offices how to integrate productivity in life-cycle cost analysis of building services. Rehva.

Wolkoff, P. (2018, apr). Indoor air humidity, air quality, and health – An overview. Inter- national Journal of Hygiene and Environmental Health 221 (3), 376–390.

Wolkoff, P. and S. K. Kjærgaard (2007, aug). The dichotomy of relative humidity on indoor air quality. Environment International 33 (6), 850–857.

Wyon, D., L. Fang, L. Lagercrantz, and P. O. Fanger (2006, apr). Experimental Determi- nation of the Limiting Criteria for Human Exposure to Low Winter Humidity Indoors (RP-1160). HVAC&R Research 12 (2), 201–213.

Zhang, X., P. Wargocki, Z. Lian, and C. Thyregod (2017). Effects of exposure to carbon dioxide and bioeffluents on perceived air quality, self-assessed acute health symptoms, and cognitive performance. Indoor Air 27 (1), 47–64.

52