<<

The Accuracy and Precision of : Tools for Validating Reaction Time

Stimuli

Thesis

Presented in Partial Fulfillment of the Requirements for the Degree Master of Arts in the

Graduate School of The Ohio State University

By

Leandro Avila Calcagnotto

Graduate Program in Communication

The Ohio State University

2020

Thesis Committee

Gerald Kosicki, Advisor

Richard Huskey

1

Copyrighted by

Leandro Avila Calcagnotto

2020

2

Abstract

Our ability to test and falsify theories rests on valid and reliable measurement.

Minimizing noise, or measurement error, is a hurdle that researchers must overcome.

Researchers use state-of-the-art tools to investigate the cognitive underpinnings of human communicative behavior and measurement error is likely to differ by instrument. It is important to determine if a given instrument is sufficiently accurate to extract signal from noise. Reaction time is commonly used to measure cognitive processes during media use.

A reaction time measurement tool has been recently implemented in Asteroid Impact, an open source, naturalistic, experimental video game stimulus. We tested if reaction time latency differed across computer operating system (macOS, Windows 10, Ubuntu 18.04), software (Python 2, Python 3), and trial modality (audio, visual) in Asteroid Impact.

We used an Arduino microcontroller to generate a ground truth value of average response time latency. Specifically, we measured the latency between the onset of visual and auditory trials and a response from the Arduino. The Arduino was programmed to look for changes in luminance (visual trial) or sound amplitude (audio trial) every millisecond. When a change was detected, the Arduino was programmed to issue a keypress response. Any observed latency between trial onset and the microcontroller’s response reflects measurement error in the computer hardware and software. Ideally, this latency will have a small and variance, which would indicate that Asteroid Impact

ii is accurate in measuring reaction times. All code to fully reproduce and replicate this study is available on GitHub (https://anonymous.4open.science/r/8b618f1e-6cf4-4bdd-

9e3b-4058a37bfb53/).

ANOVA results showed a significant three-way interaction F (2, 1188) = 199.95, p < .001, η2 = .25. Auditory trials using Python 3 on macOS had the lowest latency (M =

44.6, SD = 8.4). Auditory trials using Python 2 on Ubuntu 18.04 had the longest latency

(M = 121, SD = 18.4). Raincloud plots of the main and two-way interaction (all p’s <

.001) effects are included to further characterize the distribution of results.

Which configuration of Asteroid Impact is most accurate when measuring reaction times? It depends. Auditory trials are most accurate on macOS (regardless of

Python version). However, visual trials are most accurate using Python 2 on Ubuntu

18.04. Practically speaking, Windows 10 may be the best solution as it has acceptable accuracy for both visual and auditory trials (regardless of Python version). We are encouraged to see that reaction time collected using Asteroid Impact are comparable to other response-latency measuring software. Moreover, these values are well below the speed at which the human visual system can detect and respond to complex visual perception tasks. Together, these results demonstrate that Asteroid Impact is a valid and reliable stimulus for measuring reaction time data.

iii

Dedication

In dedication to my family and friends who supported me throughout this process.

iv

Acknowledgments

My deepest gratitude to Richard Huskey and Gerald Kosicki for their unwavering guidance, support, and devotion to this project. My sincere thanks to The Ohio State

University for the opportunities and life experience.

v

Vita

2009 ...... Palos Verdes Peninsula High School

2013 ...... B.A. Double Majored in Psychology and

Communication, University of California,

Santa Barbara

2018 to 2019 ...... Graduate Research Fellow, The Ohio State

University

2019 to 2020 ...... Graduate Teaching Associate, School of

Communication, The Ohio State University

2020 ...... M.A. Communication, The Ohio State

University

Fields of Study

Major Field: Communication

vi

Table of Contents

Abstract ...... ii Dedication ...... iv Acknowledgments ...... v Vita ...... vi List of Tables ...... ix List of Figures ...... x Chapter 1. Introduction ...... 1 The Big Problem in ...... 1 Identifying Instrument Noise ...... 1 Validating Experimental Tools ...... 2 Chapter 2. Literature Review ...... 3 The Impetus for Reaction Time Methodology ...... 4 The Variable Field of Communication ...... 6 Reaction Time Measures in Communication ...... 7 Current Trends in Communication’s Empirical Philosophy ...... 7 Asteroid Impact as an Experimental Platform ...... 8 Sources of Response Latency Noise in the Modern Era ...... 9 Microcontrollers as Validation Tools ...... 10 The Current Investigation ...... 11 Chapter 3. Methods ...... 13 Overview and Design ...... 13 Materials ...... 14 Procedure ...... 15 Chapter 4. Results ...... 16 Purpose of Experiment and Statistical Approach ...... 16 Interpreting Response Latencies ...... 16 Overall ANOVA Model and Main Effects ...... 18 Moderation Analysis for Two and Three-Way Interactions ...... 20 Commentary ...... 22

vii

Chapter 5. Discussion ...... 25 Summary of Principle Findings ...... 25 Limitations ...... 26 Value of Present Research ...... 26 Summation ...... 27 Ch 6. Conclusion ...... 28 Implications for Future Researchers ...... 28 Arduinos as Empirical Tools ...... 29 From Method to Theory ...... 29 Final Statement ...... 29 Bibliography ...... 31 Appendix A: Figures ...... 34 Appendix B: Tables ...... 42

viii

List of Tables

Table 1 Fixed-Effects ANOVA results using RT as the criterion ...... 43 Table 2 Descriptive for RT as a function of Operating System ...... 44 Table 3 Descriptive Statistics for RT as a function of Python version ...... 44 Table 4 Descriptive Statistics for RT as a function of Modality ...... 44 Table 5 and SD for RT as a function of a 3(OS) X 2(Modality) design ...... 45 Table 6 Means and SD for RT as a function of a 2(Python) X 2(Modality) design ...... 45 Table 7 Means and SD for RT as a function of a 2(Python) X 3(OS) design ...... 46 Table 8 Means and SD for RT within a 2(Modality) X 2(Python) for Ubuntu OS ...... 46 Table 9 Means and SD for RT within a 2(Modality) X 2(Python) design for macOS ...... 47 Table 10 Means/SD for RT within a 2(Modality) X 2(Python) design for Windows ...... 47

ix

List of Figures

Figure 1 Arduino Microcontroller-Breadboard Circuit ...... 35 Figure 2 Plot Diagram for Three-Way Interaction ...... 36 Figure 3 Raincloud Plot for Operating System ...... 37 Figure 4 Raincloud Plot for Python Version ...... 38 Figure 5 Raincloud Plot for Modality ...... 39 Figure 6 Raincloud Plot for Operating System and Modality Interaction ...... 40 Figure 7 Raincloud Plot for Operating System and Python Interaction ...... 41

x

Chapter 1. Introduction

The Big Problem in Science

A limitation encountered when conducting empirical research is the accuracy and precision of measurement. The noise associated with any measure is likely to differ by instrument and acts as a bottleneck for the level of depth and sophistication of scientific inquiry. The limited and of an instrument also constrains the insights and contributions capable of being drawn from its measures. Thus, in order to effectively test, falsify, and contribute to theory, valid and reliable measurement is required.

Identifying Instrument Noise

It is important to determine if a given instrument is sufficiently able to extract signal from noise. Controlled experiments that use computer-based instruments are vulnerable to noise from various sources within their hardware and software. This noise can elude researchers and lead to inaccurate data and conclusions, which may result in inconsistent research findings across labs, universities, and entire fields. Such inconsistencies can potentially undermine replication efforts, inflate empirical relationships, and contribute to false-positives findings in empirical science (Plant, 2016).

One way of correcting for the potential noise in a measure is to validate the instrument.

1

Validating Experimental Tools

A reaction time measurement tool has been recently implemented in Asteroid

Impact (Huskey, Craighead, Miller, & Weber, 2018), an open source, naturalistic, experimental video game stimulus. In line with previous research (Neath, Earle, Hallet &

Surprenant, 2011; Schubert, Murteira, Collins & Lopes, 2013), this study will use specialized hardware to establish a ground truth of measurement error for the experimental platform. This investigation will 1) convey the importance and widespread use of reaction time methodology, 2) highlight current issues in using reaction time measures, and 3) present a tool to assess response time latency across operating system

(macOS, Windows 10, Ubuntu 18.04), Python version (Python 2, Python 3), and trial modality (audio, visual) in Asteroid Impact. The results will highlight the importance of validating measuring tools and point the way towards a contribution to theory through improvements in methodology.

2

Chapter 2. Literature Review

The initial traces of reaction time methodology stem back to the pioneering work of Galileo Galilei in 1638. In an experiment designed to determine the velocity of light,

Galileo set two experimenters apart at a distance (actual distance unknown), facing one another with coverable lanterns (Foschi & Leone, 2009). The first experimenter was instructed to reveal the light from his lantern to the second experimenter, who would reactively respond by uncovering and revealing his own lantern’s light. Galileo reasoned that if light traveled at a finite velocity there would be an observable delay in luminance between the lanterns (Foschi & Leone, 2009). Foreseeing the potential for human error,

Galileo had his experimenter’s practice the speed of their reactions until their timings were near instantaneous. This practice to reduce experimenter error reveals Galileo’s understanding of a fundamental principle in empirical research, the controlling of variability and error contributed by the observer. Though the simplistic and primitive nature of Galileo’s experiment meant he could not calculate the velocity of light, he was able to conclude that the velocity of light is indeed finite (Foschi & Leone, 2009). It would only be after the invention of more precise and accurate measuring instruments that a better estimation of the velocity of light could be determined. This overview of

Galileo’s experiment highlights important scientific practices that continue to operate today, such as controlling for human error and variability, implementing methods and

3 techniques to ensure precise and accurate measures, and an understanding that novel scientific discovery often depends on the quality of measuring instruments.

The Impetus for Reaction Time Methodology

The utility of reaction time methodology in isolating mental processes was established by a series of experiments in the late 19th century. This began when an intertwining of scientific concerns from physiologists to understand the “mechanisms of sensation”, astronomers to reduce , and physicists to understand discrete mental processes, culminated into the field experimental psychology (Kirsch,

1976). Eventually, reaction time methodology would establish itself within the field of experimental psychology where it would serve a foundational role in uncovering, measuring, and understanding mental processes.

In 1795 astronomers needed to control for human error in observation after they developed an “eye and ear” method to record distances of stellar movement (Kirsch,

1976). This measure was subject to variation between astronomer’s until Friedrich

Wilhelm Bessel developed a calculation for the average difference between observations, named the “personal equation” (Kirsch, 1976). This personal equation was an initial step towards accounting for individual differences in scientific observation.

Then in the 1830’s, physiologists became interested in determining the speed of neural impulses. Physiologist Hermann von Helmholtz attempted to answer this question by stimulating frog’s nerves at varying distances in a muscle and then measuring the response time delay between muscular contractions (Kirsch, 1976). After replicating this

4 experiment on sensory nerves in humans, Helmholtz began to investigate how response times could be used to measure active mental processes (Kirsch, 1976).

Franciscus Cornelis Donders furthered this research by using response latencies to measure isolated mental processes (Roelofs, 2018). He accomplished this by combining visual and motor tasks and measuring the time taken to identify and respond to each stimulus. Donders hypothesized that response times would differ across human subjects when identifying and responding to each stimulus, which would indicate differences in mental processing (Roelofs, 2018).

Finally, Wilhelm Wundt, considered by some as one of the founders of modern experimental psychology at the University of Leipzig, performed an experiment that required a research subject to simultaneously respond to a visual and auditory stimulus

(Kirsch, 1976; Robinson, 2001). This was coined the “complication experiment” because differences in subject’s allocation of attention across both stimuli resulted in errors in either visual or auditory perception. In this way, Wundt’s work led to the first discrete measure of the mental process of attention (Kirsch, 1976; Robinson, 2001).

The collective scientific pursuits in the 19th century led to the establishment of reaction time methodology for measuring cognitive processes (Kirsch, 1976). It also helped formed the basis for psychometry, the study of the duration of mental processes

(Kirsch, 1976). Together, these experiments show how incrementally increasing precision and accuracy of measure deepens the insights and conclusions capable of being drawn.

5

The Variable Field of Communication

Communication research has for decades drawn the interest of people from a vast array of academic disciplines. Schramm summarized this accurately when he described communication as “an academic crossroad where many have passed, but few tarried.”

Communication has developed into a “variable field” (like economics or political science), unlike the “level fields” of psychology and sociology from which quantitative communication research emerged (Slater & Gleason, 2012). A level field typically locates its subject matter concepts at a given level of analysis. For example, psychology is characterized by the study of individual level and intra-individual concepts, but incorporates levels such as biological, cognitive, and sociocultural (Paisley, 1972). As a variable field, communication is able to investigate variables across several levels of analysis, which encourages the borrowing of methods, tools, and perspectives from other disciplines (Slater & Gleason, 2012).

Perhaps more than other disciplines, communication science has been derived from scholars trained in related level and variable fields such as quantitative sociology, cognitive and social psychology, as well as psychology, education, and political science, among others (Paisley, 1972; Slater & Gleason, 2012). The resulting graduate training borrows heavily from current, promising quantitative methods that are popular at any given time. Accordingly, communication has a history of methodological pluralism, which lends itself to the adoption and use of reaction time measures.

6

Reaction Time Measures in Communication

Reaction time measures in communication have contributed to a vast breadth of research findings across various topics. Reaction time measures are used to support theories on cognitive resource availability, such as the Limited Capacity Model of

Motivated Message Processing (LC4MP) (Lang, Bradley, Park, Shin & Chung, 2006).

They have also been used to measure levels of aggression after exposure to violent media

(Ferguson, Smith, Miller-Stratton, Fritz & Heinrich, 2008), and are embedded into implicit associations tests (IAT) to measure unconscious bias (Hong & Len-Rios, 2015;

Ottaway, Hayden & Oakes, 2001). In media neuroscience investigations, reaction time measures have been used to evaluate cognitive and perceptual loads (Lang et al., 2006;

Fisher, Huskey, Keene & Weber, 2018), and support measures of attention and engagement in flow experiences (Huskey et al., 2018). Even in observational social science research such as random population surveys, timers are critical to monitor the steady progress of the respondents throughout the activity (Bassili & Fletcher, 1991;

Lavrakas, Traugott, Kennedy, Holbrook, Leeuw & West, 2019). This breadth of research showcases the versatility and expanse of reaction time measures in the context of both embedded experimental treatments and laboratory settings within communication.

Current Trends in Communication’s Empirical Philosophy

In the vanguard of communication science are media neuroscientists who use state-of-the-art tools and methods to investigate the cognitive underpinnings of human communicative behavior. Media neuroscientists adopt a neurophysiological perspective which prioritizes cognitive and biological explanations for human communication

7 behavior (Weber, Sherry & Mantiak, 2008). Such an approach requires the use of a variety of measurement tools such as fMRI, EEG, and relevant for the present study, reaction time measurements. These tools help pry open the so-called “black box” (Lang

& Ewoldsen, 2009) of communication processes. Importantly, our ability to look inside this black box is contingent on the accuracy of our measuring tools.

Asteroid Impact as an Experimental Platform

At the Media Neuroscience Lab at University of California-Santa Barbara a naturalistic, experimental stimulus Asteroid Impact was created to serve as an experimental platform for media neuroscience investigations. Asteroid Impact has been used to support a variety of research findings such as flow experiences (Huskey et al.,

2018). Asteroid Impact uses an embedded reaction time tool, a secondary-task reaction time (STRT) to evaluate subject’s responses to stimuli. STRT’s are an effective way to indicate differences in cognitive processing (Lang, 2009). For STRT measures, subjects engaged in a primary task must simultaneously respond to and perform a secondary task

(Lang, 2009). In Asteroid Impact, subject’s primary task is to prevent their spacecraft

(mouse cursor) from being hit by flying asteroids. Subject’s must concurrently perform a secondary task of pressing the SPACE key when an audio (ping sound) or visual (colorful gem) stimulus is presented. The time difference between the onset of the stimulus (gem or audio ping) and the subject’s response (pressing the SPACE key) is the STRT measure. Theoretically, the STRT should measure the time needed for a subject to identify and respond to a stimulus. However, sources of noise within computer-based measuring equipment confound this measure in the form of added response time latency.

8

Sources of Response Latency Noise in the Modern Era

Response latency variability remains a universal and pervasive problem in computer-based experiments. Rather than reducing it, the of modern hardware has led to greater sources of variability and error than any time before (Plant,

2016). This problem stems from computer manufacturer’s incentive to increase multi- functionality while reducing equipment costs and substituting hardware functions for software alternatives (Plant, 2016). The issue of response latency in computer-based measuring equipment is even more important in fields that use methods such as fMRI,

EEG, and eye-tracking tools, which require millisecond-level accuracy (Plant, 2016). The benefits of millisecond-level accuracy are highlighted by findings that show how human visual systems can register the outlines of animals with 83% accuracy in just a single millisecond (Thurgood, Whitfield and Patterson, 2011). Even higher-precision EEG tools demonstrated that subjects show unconscious electrical responses to stimuli when presented at just 250 microseconds (1/4 of a millisecond) (Sperdin, Spierer, Becker,

Michel, & Landis, 2014). The ability of researchers to evidence such acute findings relies on minimizing error, particularly response time latency, within measures.

The widespread use of standardized experimental hardware and software overshadows its highly variable nature. Uncontrolled response time latency is problematic because it constrains the efforts of researchers and laboratories to ensure precision in their measures, leading to inconsistent research findings across scientific fields. Thus, whether due to funding constraints, convenience, or inflated perceptions of competence, researchers’ use of computer-based equipment introduces a major problem

9 in current science. The resolution for this problem heeds what Plant (2016) has recommended for all researchers: to self-validate experimental measuring instruments.

Microcontrollers as Validation Tools

In light of the potential issues with using computer-based measuring equipment, one fundamental question emerges: How can researchers account for response latency noise associated with their measuring tools? The answer can be found by looking to a specialized piece of hardware known as an Arduino microcontroller.

The Arduino microcontroller is an electronic controller device built with hardware and software from Arduino, a global open-source project and user community based at the Interaction Design Institute Ivrea, in Italy. (n.d.) Retrieved from https://www.arduino.cc.en/guide/introduction. Arduino microcontrollers contain a microprocessor capable of performing basic computing functions such as sending and receiving electrical inputs. The utility of the Arduino stems from its open-source programming that allows users to tailor its functionality to their specific needs. Arduino microcontrollers are often combined with external sensory-modality equipment such photoresistors, microphones, LEDs, and buzzers. The critical utility of the Arduino stems from its reliable and near-instantaneous processing speed that makes it an ideal tool to test for response latencies in computer equipment (Schubert, D’Ausilio, & Canto, 2013).

Previous literature shows that Arduino boards can even be used to track and test the response latencies of equipment by emulating a keyboard or mouse (Schubert, D’Ausilio, et al., 2013). Arduino technology has also been used to validate online-browser reaction time programs that are used to correct for response latency issues in online experiments.

10

Example 1: For instance, Schubert and Murteira et al. (2013) validated a ground truth of response latency for ScriptingRT, an online repository for collecting response latencies in online experiments. Using an Arduino to emulate a keyboard responding to the program, these researchers found differences in average latency across internet browsers that ranged from 18-47 ms with a 7 ms (Schubert, Murteira, et al., 2013).

Example 2: Neath et al. (2011) measured average response latencies (in ms) for experimental software on configurations of Macintosh computers across Java, (M = 99.7,

SD = 6.7 ms), Flash (M = 91.6, SD = 8.1), and JavaScript (M = 88, SD = 5.9) (Neath et al., 2011). Neath et al. (2011) then compared these latencies to native software (M = 49.9,

SD = 2.6) to conclude that the differences between native software and Java, Flash, and

JavaScript were acceptable. As evidenced, Arduino technology offers an effective method for measuring response latencies in computer-based experimental designs, which this investigation will perform for Asteroid Impact.

The Current Investigation

Asteroid Impact has been used as an experimental platform to support research in media neuroscience but has not yet been validated. This investigation will address this gap in the literature by testing different hardware and software configurations while running Asteroid Impact. We used an Arduino Leonardo microcontroller to test the response time latency in Asteroid Impact’s STRT measure across computer operating system (macOS, Windows 10, Ubuntu 18.04), Python version (Python 2, Python 3), and trial modality (audio, visual). Ideally, this latency will have a small mean and variance, which would indicate that Asteroid Impact is accurate in measuring reaction times. We

11 expect these response latencies to be in line with previous research (Schubert, Murteira, et al., 2013; Neath et al., 2011). The results from this study will help to establish a ground truth of measurement error for experimental configurations running Asteroid Impact.

12

Chapter 3. Methods

Overview and Design

We independently measured the response time latency of Asteroid Impact’s STRT measure across twelve unique experimental configurations. We programmed an Arduino microcontroller to function as an emulated keyboard in a data-collection simulation for

Asteroid Impact’s STRT measure with the following factorial design: 3 (operating system: macOS vs Windows 10 vs Ubuntu 18.04) � 2 (Python version: Python 2 vs

Python 3) � 2 (modality: audio vs visual). The response latency of each configuration was tested for both audio and visual stimuli, which simulated what human subjects would experience when playing Asteroid Impact. The total response latency measure was derived from Asteroid Impact’s internal clock, from the time of the trial onset to when it registered a SPACE keypress from the Arduino. Any observed latency between trial onset and interpreting the keypress reflects measurement error in the configuration. We ran 100 trials for each of the twelve configurations, totaling 1200 trials (N = 1200).

The Asteroid Impact software measured the time in milliseconds between the

STRT trial onset and offset. We monitored the consistency of the Arduino’s own response latency by programming it to start and stop a timer (in microseconds) from the when it detected a stimulus to when it issued the SPACE keypress. This latency was miniscule at approximately 80 microseconds (1/10 of a millisecond).

13

Lastly, we connected and programmed a blue LED to blink when the Arduino registered an audio or visual stimulus. This was done to help the experimenters ensure that the Arduino was constantly scanning the environment and did not impact the results.

Materials

We programmed an Arduino Leonardo Microcontroller with Headers to function as an emulated keyboard that connected to a computer via micro USB. We programmed the Arduino using the open-source Arduino Integrated Development

Environment (IDE) software written in C/C++ functions. We created a circuit between the Arduino and an Elegoo MB-102 Solderless Breadboard by connecting jumper wires between the Arduino’s ground and power pins to the breadboard’s power rails. We then connected an Anmbest Microphone Sensor, an Elegoo photoresistor, and a blue LED directly onto the breadboard. We inserted a 10k ohm resistor to limit the electrical current distributed to the photoresistor and a 220k ohm resistor to the LED. This setup constituted the Arduino-breadboard circuit that acted as our “fake human” (see Figure 1).

We tested the Arduino-breadboard circuit independently on each of three laptop computers running either macOS, Windows 10, or Ubuntu 18.04 (Linux) operating systems. The specifications for the laptop running macOS were: 2.9 Ghz Intel Core i7, 16

GB 2113 MHz LPDDR3, Radeon Pro 560 4 GB with Intel HD Graphics 630 1536 MB.

The specifications for the laptop running Windows 10: i7-7700HQ 3.8GHz, 16GB DDR4

RAM at 2400MHz, PCIe M.2 SSD, NVIDIA(R) GeForce(R) GTX 1050 with 4GB

GDDR5 Graphics. The specifications for the laptop running Ubuntu 18.04 were: Intel i7-

7500U 3.5GHz, 16GB DDR4 RAM at 1600 MHz, PCIe M.2 SSD, and Intel HD 620

14

Graphics. Each operating system-specific laptop then independently ran Asteroid Impact natively with Python 2 (version 2.7.13) or Python 3 (version 3.7.3) software. For visual trials, we programmed Asteroid Impact to offset its black background with a white background to cause a change in display luminance that the photoresistor could detect.

For audio trials, we programmed Asteroid Impact to emit a ping sound at maximum volume to cause a change in sound amplitude that the microphone sensor could detect.

Procedure

The Arduino was programmed to scan the environment every millisecond for changes in either sound amplitude or luminance. For visual trials, the photoresistor was placed within an inch of the laptop display and all other surrounding light sources were turned off. When the photoresistor detected a change that met the threshold of luminance, the Arduino was programmed to issue a SPACE keypress to the laptop running Asteroid

Impact. For audio trials, the microphone sensor was placed within an inch above the laptop’s speakers. When the microphone detected a change in sound amplitude that exceeded the decibel threshold, the Arduino was programmed to issue a SPACE keypress to the laptop running Asteroid Impact. Upon receiving the SPACE keypress, Asteroid

Impact would close the STRT prompt, delay for 3 seconds, and then repeat subsequent trials. Performing 100 trials of data collection took approximately 5 minutes to record.

15

Chapter 4. Results

Purpose of Experiment and Statistical Approach

We delineated the optimal configurations of Asteroid Impact’s reaction time measure across operating system, Python version, and modality. We performed a 3 (OS: macOS, Windows 10, Ubuntu 18.04) � 2 (Python version: Python 2, Python 3) � 2

(modality: audio, visual) factorial analysis of variance (ANOVA) to assess differences in mean response time latency across each configuration. Levene’s test for homogeneity of variance F(11, 1188) = 25.78, p < .001 and Shapiro-Wilk test for normality (p < .001) showed that both assumptions were violated. However, we purposefully ran an ANOVA statistical test due to its robustness in compensating for these violations (Girden, 1991).

Reaction time data are also prone to violations of non-normality and heterogeneity, and simulation studies also tend to be non-normal (Lix, Keselman & Keselman, 1996). A post-hoc Tukey’s honest significant difference test showed that macOS and Windows had lower mean response latencies than Ubuntu at the p < .001 level. The macOS also had a slightly smaller mean response latency compared to Windows at the at p < .001 level.

Interpreting Response Latencies

What exactly do we mean by response latency? When we report a mean response latency, we are reporting the average time in milliseconds between the start and end of a reaction time trial in Asteroid Impact. Any delay (in milliseconds) between the onset and

16 offset of a trial is hypothesized to be due to sources of noise within the system’s features, which we have categorized into our three independent factors. The trial’s onset prompts a stimulus cue that the Arduino will detect and respond to by issuing a keypress to Asteroid

Impact. Once Asteroid Impact receives and registers the keypress it will close the trial and record the time between the start and end of the trial. Hence, a larger mean response latency would indicate that the configuration is slower to process passing through the system. For example, if a subject identifies and responds to a stimulus in 500 ms, then the raw data score will have additional sources of response time latency (error) added to it. This response latency would exist both before the stimulus appeared on the screen (e.g. latency of display to present image depends on refresh rate) and after the subject initiated a keypress (e.g. latency of AI software to receive and interpret the keypress). Pre and post-response latencies are not a true measure of the subject’s reaction time, but nonetheless will be systematically incorporated into the subject’s data score and all other data collected with that particular configuration. The fundamental issue with these errors is that they cloud researcher’s interpretations of the data. Depending on how deeply a researcher wants to investigate, the level of accuracy and precision of their measuring tools might mean the difference between attaining or not. As such, the interpretation of our response time latency results can be boiled down to the oxymoron, “less is more” meaning that small means and standard deviations indicate faster performance across the configuration.

17

Overall ANOVA Model and Main Effects

What did the resulting ANOVA model show? Table 1 shows the results of a fixed-effects ANOVA for response time latency across all three factors had significant main effects, two-way, and three-way interactions (all p < 0.001). Furthermore, the dataset has a large degree of freedom (df = 1188) derived from the sample size (N =

1200). Table 1 presents F values that show the ratio of between-group variance to within- group variances. The larger the F value, the more likely it is to indicate that a significant difference exists between factors and we can reject the possibility that the results occurred by chance. Table 1 also shows partial effect sizes (η2), which indicate the proportion of variance attributable to differences between factors compared to differences within the factor. Another way of conceptualizing this is the amount of explained to unexplained variance, such that we can explain the differences between factors but cannot explain the within-factor variations. As shown in Table 1, we found statistical significance for the ANOVA omnibus test, which means that our model significantly differs from the null model (that no significant difference exists between independent variables), F(2, 1188) = 199.95, p < 0.001, η2 = .25. That is, there are actual differences attributable to each factor that account for differences in mean response time latency.

What did the main effects show? Tables 2-4 show the main effects of factor, including means, standard deviation, and the upper and lower limits of the confidence interval. The mean indicates the overall average response time latency at one level of a factor, while the standard deviation indicates the average deviation (variability) of the entire sample from the mean. Small standard deviations indicate better consistency across

18 results. The upper and lower limits of confidence interval indicate that there is a 95% chance that the true mean lies within the limits. A lack of overlap across confidence intervals means that there is a significant difference between them.

From Table 2 we found that on average for operating systems, macOS (M = 58.6,

SD = 14.7) was fastest and comparable to Windows 10 (M = 62.6, SD = 10.3), while

Ubuntu 18.04 was slowest (M = 80 , SD = 28.8), F(2, 1188) = 592.74 p < 0.001, partial η2

= .50 (see Table 1). This means that Ubuntu had the highest latency and variability when compared to macOS and Windows 10, which performed similarly. The partial effect size

(η2) was notable for operating system because it indicates that 50% of the variance in response time latency was explained by differences in operating system. Essentially, operating system alone contributes to half of the variability in each measure. This means controlling for operating system in an experiment could significantly influence results.

From Table 3 we see that on average for Python version, Python 3 (M = 64.6, SD

= 14.4) was shown to be significantly faster and more consistent than Python 2 (M =

69.5, SD = 26.8), F(1, 1188) = 83.82 p < 0.001, partial η2 = .07 (see Table 1). This means that Python 3 is more reliable and outperforms Python 2. However, this relationship only explains 7% of the variability in response latency, which is relatively small.

From Table 4 we see that on average for modality type, visual data (M = 62.4, SD

= 10.4) were processed significantly faster and more consistently than audio data (M =

71.7, SD = 28.1), F(1, 1188) = 297.96 p < 0.001, partial η2 = .20 (see Table 1). This indicates that visual stimuli were processed quicker and more consistently throughout the

19 system. Accordingly, modality accounted for 20% of the variation in response time latency, which is a reasonably-sized effect.

The main effects and partial effect sizes for these three factors highlight the large role that operating system and modality play in response latency variability within

Asteroid Impact. However, Python version did not show as significant a contribution.

While these main effects were significant, they were conditionally based on the broader scope of the interaction effects across all three factors in a configuration.

Moderation Analysis for Two and Three-Way Interactions

Were there any two-way moderation effects across different factors? Two-way interactions were found amongst all factors: operating system was moderated by modality, modality was moderated by Python version, and Python version was moderated by operating system. A two-way interaction stipulates that one level differs across levels of another factor. Tables 5-7 display means and standard deviations across the two-way interactions. Additionally, they contain marginal means, which represent the average score of one level of a factor across all levels of another factor.

In the two-way interaction between operating system and modality shown on

Table 5, macOS testing audio modality trials performed the fastest, (M = 45.3, SD = 8.4), while Ubuntu 18.04 testing audio modality trials performed slowest (M = 102, SD =

24.6), F(2, 1188) = 1447.42 p < 0.001, partial η2 = .71 (see Table 1). Windows 10 performed relatively consistent across both audio (M = 67.6, SD = 6.7) and visual (M =

57.5, SD =10.8) modalities. This two-way interaction accounted for 71% of the variance in mean response latency, which is a significantly large contribution.

20

In the two-way interaction between Python version and modality shown on Table

6, Python 3 (M = 64.8, SD = 18.4) ran audio modality trials faster and more consistently than Python 2 (M = 78.6, SD = 33.8), F(1, 1188) = 270 p < 0.001, partial η2 = .19 (see

Table 1). There was a negligible difference between Python 2 (M = 60.4, SD = 11.3) and

Python 3 (M = 64.4, SD = 9) when running visual modality trials. This two-way interaction accounted for 19% of the variance in mean response time latency.

In the two-way interaction between operating system and Python version shown on Table 7, across both versions of Python (2, 3), macOS (M = 58.9, SD = 14.2; M =

58.4, SD = 15.3) and Windows 10 (M = 62.8, SD = 11.1; M = 62.3, SD = 9.4) performed similarly, F(2, 1188) = 68.4 p < 0.001, partial η2 = .10 (see Table 1). When Ubuntu 18.04 ran Python 2 (M = 86.9, SD = 37) software it performed the slowest (M = 86.9) and was most volatile (SD = 37) in comparison to when it ran Python 3 (M = 73, SD = 13.8) software. Altogether, the two-way interaction between operating system and Python version accounted for only 10% of the variance in mean response latency.

What was the significance of the three-way interaction? In general, the three-way interactions across all factors and levels of each configuration means that reaction time latency depends on the individual influences of each level within operating system,

Python version, and modality, both on and across one another. Three-way interactions lend themselves to complicated explanations. Simply put, our three-way interaction was significant in our ANOVA model, F(2, 1188) = 199.95, p < .001, η2 = .25 (see Table 1).

It accounted for 25% of the variance in mean response latency. Having this significant three-way interaction means that the previous main effects and two-way interactions only

21 occurred in certain cases. That is, Table 8 shows that Python version had a noticeably negative effect on Ubuntu 18.04 when it tested audio trials (M = 121.1, SD = 18.4), resulting in the slowest overall performance. By contrast, Table 9 showed that when macOS ran audio trials using Python 3 software (M = 44.6, SD = 8.4) it had the lowest response latency overall, earning it the fastest overall performance.

Commentary

A plot diagram of the three-way interaction with confidence intervals was created to highlight the moderation effects (see Figure 2). The plot’s y-axis corresponds to mean response time latency from 40-120 ms. Any overlap between error bars in the confidence interval indicates that no significant difference can be determined. From Figure 2, it is clear that Python had the largest impact on Ubuntu 18.04. MacOS is clearly the fastest for audio trials (regardless of Python version). Ubuntu 18.04 is most accurate with visual trials and least accurate for audio trials with Python 2. Windows 10 is better for visual trials than macOS (regardless of Python version).

We further characterized the data by using raincloud plots. Raincloud plots were used because they combine statistical effects with raw data (Allen, Poggiali, Whitaker,

Marshall, & Kievit, 2019). Raincloud plots show the density of raw data points with an overarching distribution. This gives the viewer better perspective on the ratio of raw data to the shape of the distribution, which allows for feasibility in interpreting results (Allen et al., 2019). In each raincloud plot, the x-axis shows increments of response time latency in milliseconds. Below each (the “cloud”) represents the density of raw data points (the “rain”). By viewing the patterns of raw data

22 points and the peaks and valleys of the distribution it is possible to quickly distinguish and compare the performances of levels across a factor.

What did the distributions of response time latency look like? By analyzing the main effect raincloud plots (Figures 3-5) for each of the three factors we can see the performance of each feature more clearly. Raincloud plots for operating system in Figure

3 highlight the relatively normal distribution of Windows 10 response times. This is an important observation because it shows that the raw data points are more clustered in the center and equally distributed on either end about the mean. The narrow and tall peak for

Windows 10 indicates that it’s response latency will fall within a small margin about the mean, making it easier to predict and control. Conversely, Figure 3 shows that Ubuntu

18.04 has the most non-normal distribution. This non-normality occurs because of significant outliers that may have resulted in the violation of assumptions for the

ANOVA model. Additionally, Ubuntu 18.04’s s wide distribution shows that its response latencies are more inconsistent compared to macOS or Windows 10. While macOS performed similarly to Windows 10, its raw data points (“rain”) clustered in a more irregular pattern, indicating more variability compared to Windows 10. In terms of balancing low response latency with high consistency, Windows 10 presents the best option for operating system.

The raincloud plots for the main effect of Python in Figure 4 showcase that

Python 3 had a narrower bell-shaped distribution compared to Python 2. Python 2 also had a significant number of high outliers between 100-150 ms that contributed to its uneven distribution. Overall, this presents Python 3 as the more efficient version.

23

The raincloud plot for the main effect of modality in Figure 5 highlights how visual modality trials were more consistent and predictable than audio. Visual modality trials showed centralized peaks in two areas surrounding the 50 ms time increment, while audio modality response latencies spanned widely from 40 ms to nearly 150 ms. This variability shows that in general visual modality trials perform better than audio trials.

We created two-way interaction raincloud plots in Figures 6-7. Though overlaying multi-level distributions can complicate interpretation, there are several evident takeaways. First, Windows 10 has the most centralized distribution with generally high peaks and bell-shaped distributions, which shows it is the most balanced operating system. That is, even though macOS outperforms Windows 10 for audio modality trials, Windows 10 is more accommodating to differences in trial modality than macOS. Second, Figure 6 highlights how audio modality trials contain most of the variant data points compared to visual modality trials that are more centralized within the 50-100 ms range. Finally, Figure 7 highlights how Python 2 has a distinctly negative effect on

Ubuntu 18.04, whereas Python 3 is more consistent across the three operating systems.

Overall the moderation effects from the two and three-way interactions show a significant influence on response time latencies in Asteroid Impact. This confirms our reasoning that different hardware and software configurations can impact measures from computer-based instruments. Moreover, our results have shown that response time latency does vary significantly across configurations of operating system, Python version, and trial modality, and that a hierarchy of performance does exist among them.

24

Chapter 5. Discussion

Summary of Principle Findings

We used an Arduino microcontroller to generate a ground truth value of average response time latency for different configurations of Asteroid Impact. Specifically, we measured the latency between the onset of reaction time trials and a registered response from the Arduino across operating system (macOS, Windows 10, Ubuntu 18.04), Python version (Python 2, Python 3), and trial modality (audio, visual). Our results showed that interactions between the levels in each factor had a significant impact on mean response time latency. The moderation effects from these interactions produced several actionable results that are relevant to future researchers. Specifically, if primarily concerned with optimizing performance in Asteroid Impact to test audio data, then macOS is the optimal choice and Ubuntu 18.04 is the worst, regardless of Python version. If primarily concerned with optimizing performance in Asteroid Impact to test visual data, then

Ubuntu 18.04 running Python 2 software is the best. However, Windows 10 is not markedly slower than Ubuntu 18.04 in this respect and runs more consistently (across both Python and modality), so it is a more balanced choice. Significant differences were found between macOS and Windows 10 processing speed, but these differences were negligible. Thus, if selecting a configuration for overall performance, with no preference for either data type, then Windows 10 offers the best performance and consistency.

25

Limitations

One major design limitation was testing each operating system on a different laptop computer. While the hardware specifications across the three laptop computers were comparable, the lack of experimental control for this introduced potential variation that undoubtedly contributed sources of error and inconsistency to our measures. In the future, using the same laptop equipment for all operating systems would be recommended. However, this would restrict researchers to using Apple hardware since only it can run all three operating systems. Another limitation was the lack of control regarding noise from background programs running on the laptops. For instance, pop-up blockers, firewalls, anti-virus programs, updates, and other software/firmware running in the background of each laptop computer could have added latency artifact into the measures (Plant, 2016). Lastly, we did not specifically measure differences in display luminance, refresh rate, or sound quality across each laptop, which may have differentially affected the Arduino’s ability to read audio and visual cues.

Value of Present Research

Despite these limitations we are encouraged to see that Asteroid Impact’s reaction time measures are comparable to other response-latency measuring software. Both

ScriptingRT (Schubert, Murteira, et al., 2013), Matlab and Psychtoolbox (Neath et al.,

2011) showed mean response latencies ranging from 18-100 ms with small standard deviations below 10 ms. Several of our configurations produced response latencies within these limits. Moreover, these values are well below the speed at which the human visual system can detect and respond to complex visual perception tasks (Thorpe, Fize, &

26

Marlot, 1996). Together, these results satisfy a gap in the literature regarding the validity of Asteroid Impact as an accurate and precise platform for measuring reaction times.

Summation

Our results demonstrate how experimental platforms such as Asteroid Impact can benefit from external validation. Our analysis of response latencies validates previous research findings derived from Asteroid Impact and establishes credibility for its future use in media neuroscience investigations. Our results suggest that researchers should be meticulous in their selection of experimental tools and equipment as they can become sources of measurement error. Future research on Asteroid Impact should investigate the influence of additional hardware components such as keyboards, computer mice, and headphones that human subjects would be using during an actual experiment. At a micro- level, this investigation showed that the selection of computer operating system, Python software, and data modality in Asteroid Impact does matter. At the macro-level, our results show that researchers should exert more control over their experimental designs by validating the precision and accuracy of their measuring tools.

27

Ch 6. Conclusion

External validation of measuring instruments is critical in producing valuable insights and conclusions across scientific disciplines. Unvalidated measures may appear precise, but may not necessarily be accurate (Plant, 2016). Thus, a researcher’s overconfidence in standardized, computer-based measuring instruments can contribute to replication errors across labs and universities. This can inflate research findings within fields of study or result in replication failures that contribute to the file-drawer problem in empirical research (Plant & Quinlan, 2013). Fundamentally, research that seeks to distinguish acute differences between variables should validate their measuring instruments to ensure that it is both precise and accurate.

Implications for Future Researchers

The usage of untested measuring instruments remains empirically unacceptable given the level of precision and accuracy that is required to effectively contribute to scientific theory. One recommendation for academic journals is to enforce experimental tool validation reports within the results section (Plant, 2016). Researcher’s should also be aware of other potential sources of error such as anti-virus software, display monitor refresh rates, and firmware updates. While it is impossible to eradicate all sources of error, a significant first step should be to externally validate the precision and accuracy of their measuring instruments (Plant, 2016).

28

Arduinos as Empirical Tools

We demonstrated the utility of the Arduino microcontroller as a simple and effective tool to validate the accuracy and precision of reaction time measures. We hoped to communicate the value of Arduino technology and encourage researchers to add it to their empirical toolbelts. In line with the ethos of open science, we share the instructions and code required to create our “fake human” on GitHub

(https://anonymous.4open.science/r/8b618f1e-6cf4-4bdd-9e3b-4058a37bfb53/).

From Method to Theory

One final point we make is that method-driven contributions within the field of communication can be as rewarding and useful as theory-driven research. This idea is encapsulated by Greenwald (2012), who wrote, “There is nothing so theoretical as a good method.” Greenwald (2012) inverted the relationship between method and theory by showing that more Nobel prizes were awarded for method-driven contributions in psychology-related findings than for theory-driven ones in the past half century. This suggests that by pursuing better methods we can advance the depth of scientific inquiry and uncover novel research findings, thereby producing valuable theoretical contributions

(Greenwald, 2012). We believe that the present research supports method-driven research by validating the measuring instrumentation within Asteroid Impact, which can then be used as a platform for future research to drive theoretical contributions.

Final Statement

We grounded this investigation within the neurophysiological perspective in media neuroscience, which lays the groundwork for more sophisticated scientific

29 inquiries in communication science. We close with a simple and meaningful message from Weber et al. (2008): By broadening the scope of our understanding we are able to answer the fundamental “why” questions of human behavior.

30

Bibliography

Allen, M., Poggiali, D., Whitaker, K., Marshall, T. R., & Kievit, R. A. (2019). Raincloud plots: a multi-platform tool for robust data visualization. Wellcome Open Research, 4.

Bassili, J. N., & Fletcher, J. F. (1991). Response-time measurement in survey research a method for CATI and a new look at nonattitudes. Public Opinion Quarterly, 55(3), 331-346.

Ferguson, C. J., Smith, S., Miller-Stratton, H., Fritz, S., & Heinrich, E. (2008). Aggression in the laboratory: Problems with the validity of the modified Taylor Competitive Reaction Time Test as a measure of aggression in media violence studies. Journal of Aggression, Maltreatment & Trauma, 17(1), 118-132.

Fisher, J. T., Huskey, R., Keene, J. R., & Weber, R. (2018). The limited capacity model of motivated mediated message processing: Looking to the future. Annals of the International Communication Association, 42(4), 291-315.

Foschi, R., & Leone, M. (2009). Galileo, measurement of the velocity of light, and the reaction times. Perception, 38(8), 1251-1259.

Girden, E. R. (1992). ANOVA: Repeated Measures (No. 84). Sage.

Glass, G. V., Peckham, P. D., & Sanders, J. R. (1972). Consequences of failure to meet assumptions underlying the fixed effects analyses of variance and covariance. Review of Educational Research, 42(3), 237-288.

Greenwald, A. G. (2012). There is nothing so theoretical as a good method. Perspectives on Psychological Science, 7(2), 99-108.

Harwell, M. R., Rubinstein, E. N., Hayes, W. S., & Olds, C. C. (1992). Summarizing Monte Carlo results in methodological research: The one-and two-factor fixed effects ANOVA cases. Journal of Educational Statistics, 17(4), 315-339.

Huskey, R., Craighead, B., Miller, M. B., & Weber, R. (2018). Does intrinsic reward

31

motivate cognitive control? A naturalistic-fMRI study based on the synchronization theory of flow. Cognitive, Affective, & Behavioral Neuroscience, 18(5), 902-924.

Hong, S., & Len-Riós, M. E. (2015). Does race matter? Implicit and explicit measures of the effect of the PR spokesman's race on evaluations of spokesman source credibility and perceptions of a PR crisis' severity. Journal of Public Relations Research, 27(1), 63-80.

Introduction. (n.d.). Retrieved from https://www.arduino.cc/en/guide/introduction.

Kirsch, I. (1976). The impetus to scientific psychology: A recurrent pattern. Journal of the History of the Behavioral , 12(2), 120-129.

Lang, A. (2009). The limited capacity model of motivated mediated message processing. The SAGE Handbook of Media Processes and Effects, 193-204.

Lang, A., Bradley, S. D., Park, B., Shin, M., & Chung, Y. (2006). Parsing the resource pie: Using STRTs to measure attention to mediated messages. Media Psychology, 8(4), 369-394.

Lang, A., & Ewoldsen, D. (2009). Beyond effects: Conceptualizing communication as dynamics, complex, nonlinear, and fundamental. In S. Allen (Ed.), Rethinking Communication Keywords in Communication Research (pp. 109–120). New York: Hampton Press.

Lavrakas, P. J., Kennedy, C., de Leeuw, E. D., West, B. T., Holbrook, A. L., & Traugott, M. W. (2019). Probability Survey‐Based Experimentation and the Balancing of Internal and External Validity Concerns. Experimental Methods in Survey Research: Techniques that Combine Random Sampling with Random Assignment, 1-18.

Lix, L. M., Keselman, J. C., & Keselman, H. J. (1996). Consequences of assumption violations revisited: A quantitative review of alternatives to the one-way analysis of variance F test. Review of Educational Research, 66(4), 579-619.

Neath, I. (2011). Note: This version may differ from the published version: Neath, I., Earle, A., Hallett, D., & Surprenant, AM (2011). Response time accuracy in Apple Macintosh computers. Behavior Research Methods, 43, 353-362.

Ottaway, S. A., Hayden, D. C., & Oakes, M. A. (2001). Implicit attitudes and racism: Effects of word familiarity and frequency on the implicit association test. Social Cognition, 19(2), 97-144.

32

Paisley, W. (1972). Communication research as a behavioral discipline. Palo Alto, CA: Stanford University Institute for Communication Research.

Plant, R. R. (2016). A reminder on millisecond timing accuracy and potential replication failure in computer-based psychology experiments: An open letter. Behavior Research Methods, 48(1), 408-411.

Plant, R. R., & Quinlan, P. T. (2013). Could millisecond timing errors in commonly used equipment be a cause of replication failure in some neuroscience studies?. Cognitive, Affective, & Behavioral Neuroscience, 13(3), 598-614.

Robinson, D. K. (2001). Reaction-time experiments in Wundt’s institute and beyond. In Wilhelm Wundt in History (pp. 161-204). Springer, Boston, MA.

Roelofs, A. (2018). One hundred fifty years after Donders: Insights from unpublished data, a replication, and modeling of his reaction times. Acta Psychologica, 191, 228-233.

Schubert, T. W., D’Ausilio, A., & Canto, R. (2013). Using Arduino microcontroller boards to measure response latencies. Behavior Research Methods, 45(4), 1332- 1346.

Schubert, T. W., Murteira, C., Collins, E. C., & Lopes, D. (2013). ScriptingRT: A software library for collecting response latencies in online studies of cognition. PLOS One, 8(6), e67769.

Slater, M. D., & Gleason, L. S. (2012). Contributing to theory and knowledge in quantitative communication science. Communication Methods and Measures, 6(4), 215-236.

Sperdin, H. F., Spierer, L., Becker, R., Michel, C. M., & Landis, T. (2015). Submillisecond unmasked subliminal visual stimuli evoke electrical brain responses. Human Brain Mapping, 36(4), 1470-1483.

Thorpe, S., Fize, D., & Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381(6582), 520-522.

Thurgood, C., Whitfield, T. A., & Patterson, J. (2011). Towards a visual recognition threshold: New instrument shows humans identify animals with only 1 ms of visual exposure. Vision Research, 51(17), 1966-1971.

Weber, R., Sherry, J., & Mathiak, K. (2008). The neurophysiological perspective in mass communication research. Biological Dimensions of Communication: Perspectives, Methods, and Research, 41-71. 33

Appendix A: Figures

34

A B

C

G F

E H

D

I

Note. A = Microphone sensor, B = Photoresistor, C = LED, D = Arduino Leonardo, E = Breadboard, F = 10k resistor, G = 220k resistor, H = Jumper wire, I = Micro USB port

Figure 1 Arduino Microcontroller-Breadboard Circuit

35

Three Way Interaction Error Bars Represent 95% CI of the Mean Mac Ubuntu Windows

120

100

Python Python 2 mean Python 3 80

60

40 Audio Visual Audio Visual Audio Visual Modality Note. Y-axis represents mean response time in ms. Error bars indicate the lower and upper limits of the 95% confidence interval for the mean, respectively.

Figure 2 Plot Diagram for Three-Way Interaction

36

Raincloud Plot For Operating System

Windows

OS

Mac

OS Ubuntu Ubuntu Windows

Mac

50 100 150 RT Note. X-axis: Response time in ms. Y-axis: Probability distribution of raw data scores.

Figure 3 Raincloud Plot for Operating System

37

Raincloud Plot For Python Version

Python 3

Python

Python 2

Python Python 3

Python 2

50 100 150 RT

Note. X-axis: Response time in ms. Y-axis: Probability distribution of raw data scores.

Figure 4 Raincloud Plot for Python Version

38

Raincloud Plot For Modality

Visual

Modality Audio

Modality Visual

Audio

50 100 150 RT Note. X-axis: Response time in ms. Y-axis: Probability distribution of raw data scores.

Figure 5 Raincloud Plot for Modality

39

Raincloud Plot For Operating System by Modality Interaction

Visual

OS Mac Ubuntu

Modality Windows

Audio

50 100 150 RT Note. X-axis: Response time in ms. Y-axis: Probability distribution of raw data scores.

Figure 6 Raincloud Plot for Operating System and Modality Interaction

40

Raincloud Plot For Operating System by Python Interaction

Python 3

OS Mac Ubuntu Python Windows

Python 2

50 100 150 RT Note. X-axis: Response time (RT) in ms. Y-axis: Probability distribution of raw data.

Figure 7 Raincloud Plot for Operating System and Python Interaction

41

Appendix B: Tables

42

Sum η2 Mean partial Predictor of df F p partial 95% CI Square η2 Squares [LL, UL]

(Intercept) 5394172.52 1 5394172.52 61965.79 .000

Modality 25937.70 1 25937.70 297.96 .000 .20 [.16, .24] OS 103197.78 2 51598.89 592.74 .000 .50 [.46, .53] Python 7296.40 1 7296.40 83.82 .000 .07 [.04, .09] Modality x 251998.50 2 125999.25 1447.42 .000 .71 [.68, .73] OS Modality x 23505.60 1 23505.60 270.02 .000 .19 [.15, .22] Python OS x 11903.79 2 5951.90 68.37 .000 .10 [.07, .14] Python Modality x OS x 34812.33 2 17406.17 199.95 .000 .25 [.21, .29] Python Error 103416.37 1188 87.05 Note. OS = operating system, RT = response time, Python = Python version. LL and UL indicate lower and upper limits of the 95% confidence interval for the mean, N = 1200.

Table 1 Fixed-Effects ANOVA results using RT as the criterion

43

M

OS M 95% CI SD

[LL, UL]

Mac 58.61 [57.16, 60.06] 14.74 Ubuntu 79.96 [77.13, 82.79] 28.76

Windows 62.57 [61.56, 63.58] 10.28 Note. M and SD represent mean and standard deviation, N = 1200.

Table 2 Descriptive Statistics for RT as a function of Operating System

M Python M 95% CI SD version [LL, UL]

Python 2 69.51 [67.36, 71.66] 26.79 Python 3 64.58 [63.42, 65.74] 14.45 Note. M and SD represent mean and standard deviation, N = 1200.

Table 3 Descriptive Statistics for RT as a function of Python version

M Modality M 95% CI SD [LL, UL] Audio 71.69 [69.45, 73.94] 28.06 Visual 62.40 [61.56, 63.23] 10.37 Note. M and SD represent mean and standard deviation, N = 1200.

Table 4 Descriptive Statistics for RT as a function of Modality

44

Modality

Audio Visual Marginal

OS M SD M SD M SD

Mac 45.32 8.37 71.90 3.27 58.61 14.74 Ubuntu 102.16 24.62 57.76 7.82 79.96 28.76 Windows 67.61 6.71 57.53 10.76 62.57 10.28 Marginal 71.69 28.06 62.40 10.37 Note. M and SD represent mean and standard deviation, N = 1200.

Table 5 Means and SD for RT as a function of a 3(OS) X 2(Modality) design

Modality

Audio Visual Marginal

Python M SD M SD M SD

Python 2 78.59 33.84 60.44 11.25 69.51 26.79

Python 3 64.80 18.35 64.36 9.01 64.58 14.45

Marginal 71.69 28.06 62.40 10.37 Note. M and SD represent mean and standard deviation, N = 1200.

Table 6 Means and SD for RT as a function of a 2(Python) X 2(Modality) design

45

OS

Mac Ubuntu Windows Marginal

Python M SD M SD M SD M SD

Python 2 58.85 14.18 86.88 37.05 62.80 11.11 69.51 26.79 Python 3 58.37 15.32 73.04 13.76 62.34 9.40 64.58 14.45 Marginal 58.61 14.74 79.96 28.76 62.57 10.28 Note. M and SD represent mean and standard deviation, N = 1200.

Table 7 Means and SD for RT as a function of a 2(Python) X 3(OS) design

Python

Python 2 Python 3 Marginal

Modality M SD M SD M SD

Audio 121.12 18.37 83.20 12.46 102.16 24.62 Visual 52.64 7.30 62.88 4.10 57.76 7.82 Marginal 86.88 37.05 73.04 13.76 Note. Ubuntu OS data, n = 400. M and SD represent mean and standard deviation.

Table 8 Means and SD for RT within a 2(Modality) X 2(Python) for Ubuntu OS

46

Python

Python 2 Python 3 Marginal

Modality M SD M SD M SD

Audio 46.00 8.30 44.64 8.43 45.32 8.37 Visual 71.71 1.24 72.09 4.47 71.90 3.27 Marginal 58.85 14.18 58.37 15.32 Note. macOS data, n = 400. M and SD represent mean and standard deviation.

Table 9 Means and SD for RT within a 2(Modality) X 2(Python) design for macOS

Python

Python 2 Python 3 Marginal

Modality M SD M SD M SD

Audio 68.64 7.30 66.57 5.92 67.61 6.71

Visual 56.96 11.21 58.10 10.32 57.53 10.76

Marginal 62.80 11.11 62.34 9.40 Note. Windows OS data, n = 400. M and SD represent mean and standard deviation.

Table 10 Means/SD for RT within a 2(Modality) X 2(Python) design for Windows

47