<<

Crowdsourcing using brain signals

Keith Michael Davis III

Helsinki April 3, 2020 UNIVERSITY OF HELSINKI Department of Computer Science HELSINGIN YLIOPISTO — HELSINGFORS UNIVERSITET — UNIVERSITY OF HELSINKI Tiedekunta — Fakultet — Faculty Koulutusohjelma — Studieprogram — Study Programme

Faculty of Science Computer Science Tekijä — Författare — Author Keith Michael Davis III Työn nimi — Arbetets titel — Title using brain signals

Ohjaajat — Handledare — Supervisors Tuukka Ruotsalo Työn laji — Arbetets art — Level Aika — Datum — Month and year Sivumäärä — Sidoantal — Number of pages Master’s thesis April 3, 2020 68 pages + Tiivistelmä — Referat — Abstract

We study the use of data collected via electroencephalography (EEG) to classify stimuli presented to subjects using a variety of mathematical approaches. We report an experiment with three objectives: 1) To train individual classifiers that reliably infer the class labels of visual stimuli using EEG data collected from subjects; 2) To demonstrate brainsourcing, a technique to combine brain responses from a group of human contributors each performing a recognition task to determine classes of stimuli; 3) To explore collaborative filtering techniques applied to data produced by individual classifiers to predict subject responses for stimuli in which data is unavailable or otherwise missing. We reveal that all individual classifier models perform better than a random baseline, while a brainsourcing model using data from as few as four participants achieves performance superior to any individual classifier. We also show that matrix factorization applied to classifier outputs as a collaborative filtering approach achieves predictive results that perform better than random. Although the technique is fairly sensitive to the sparsity of the dataset, it nonetheless demonstrates a viable proof-of-concept and warrants further investigation.

ACM Computing Classification System (CCS): Applied computing → Life and medical sciences Human-centered computing → Human computer interaction (HCI) Human-centered computing → Brain-computer Interfacing (BCI)

Avainsanat — Nyckelord — Keywords crowdsourcing, EEG, , BCI, collaborative filtering Säilytyspaikka — Förvaringsställe — Where deposited

Muita tietoja — övriga uppgifter — Additional information

Thesis for the Algorithms, Data Analytics and Machine Learning subprogramme Contents

1 Introduction 1

2 Background 4 2.1 Crowdsourcing ...... 4 2.2 Electroencephalograpy ...... 6 2.2.1 EEG Equipment ...... 8 2.2.2 Properties of EEG Data ...... 9 2.2.3 Artifacts ...... 10 2.3 Event-related Potentials ...... 11 2.3.1 P300 Component ...... 13 2.3.2 N170 Component ...... 14 2.4 Brain-computer Interfaces ...... 15 2.4.1 Invasive and Non-invasive BCIs ...... 15 2.4.2 Active, Reactive, and Passive BCIs ...... 15 2.4.3 Assistive and Non-Assistive BCIs ...... 16 2.5 Machine Learning for BCIs ...... 17 2.5.1 Linear Discriminant Analysis ...... 18 2.5.2 Classification of EEG Data Using LDA ...... 20 2.5.3 Comparing EEG Data Between Subjects and Trials ...... 20 2.6 EEG Data Preprocessing ...... 21 2.6.1 Filtering ...... 21 2.6.2 Baseline Correction ...... 22 2.6.3 Artifact Rejection and Correction ...... 23 2.7 Collaborative Filtering ...... 24 2.7.1 Matrix Factorization for Collaborative Filtering ...... 25

3 Research Design 27 3.1 Research Questions ...... 27 3.2 Experimental Setup ...... 28 3.2.1 Participants ...... 28 3.2.2 Stimuli ...... 28 3.2.3 Apparatus ...... 29

ii 3.2.4 Tasks ...... 29 3.2.5 EEG Preprocessing ...... 30 3.2.6 Performance Measures ...... 31 3.3 Experiments ...... 32 3.3.1 Single-Trial ERP Classification Experiment ...... 32 3.3.2 Brainsourcing Experiment ...... 33 3.3.3 Collaborative Filtering Experiment ...... 34

4 Results 36 4.1 Neurophysiological Results ...... 36 4.2 Per-Participant Classification Results ...... 37 4.3 Brainsourcing Results ...... 38 4.3.1 Characteristics of Brainsourcing Output ...... 39 4.3.2 Analysis of Sparsity and Crowd Size ...... 41 4.4 Matrix Factorization Results ...... 43

5 Discussion 45 5.1 Summary of Contributions ...... 45 5.2 Limitations ...... 46 5.3 Future Work ...... 47 5.4 Social and Ethical Concerns ...... 49 5.4.1 BCI and Personal Privacy ...... 49 5.4.2 Consequences of Ubiquitous BCI ...... 49

6 Conclusion 51

References 53

iii 1 Introduction

Interconnection between people is at greater levels than any time before in history. From 2009 to 2019, the percentage of the world population connected to the increased from approximately 26% to 54% [1]. This interconnection has produced, among many other things, the possibility for individuals all over the world to collab- orate to solve a wide variety of problems, unbounded by time or space. Crowdsourc- ing, a broad field centered around the completion of distributed tasks completed by individuals, has experienced significant growth as a method to generate new data and knowledge artifacts within academia and for-profit companies [2]. Furthermore, crowdsourcing can be used to produce collaborative artifacts that would otherwise be difficult or time-consuming for a single person to produce on their own, such as articles. The most common crowdsourcing paradigms in use today rely upon informa- tion that is explicitly provided by many individuals, or through the collection of implicit data generated by many individuals while they complete other tasks. Ex- plicit data may include artifacts generated in direct response to a specific task, such as translated sentences [3], digital transcription of historical documents [4], opinion mining through review websites like Yelp and TripAdvisor [5], and recording traffic and police activity [6]. Explicit data are consciously produced by the user. Implicit data, by contrast, are typically generated as a natural consequence of some other task completed by the user. Examples of implicit data include clickthrough rates and time spent on a page [7], scrolling behavior [8], as well as physiological feedback such as facial expressions [9], eye movements [10], and even skin conductance and muscular activities in the brow [11]. While used in a wide variety of contexts with reasonable rates of success, both implicit and explicit crowdsourcing approaches have their limitations. Ex- plicit crowdsourcing tasks require conscious effort from individual participants who are subject to fatigue over prolonged periods of time - participants may be tempted to cut corners as completion of the task proceeds for too long, leading to reduced quality of collected data [12]. Implicit crowdsourcing tasks requires careful design and strong validation of assumptions as to what the implicit data they produce may suggest - an improperly designed task that relies upon incorrect assumptions may lead to the wrong conclusions being drawn from the data. Despite these limitations, explicit and implicit data can be quickly and cheaply collected at scale over the internet through platforms such as Amazon Mechanical Turk, where crowdsourcing tasks can be distributed to groups of crowdworkers. The relative affordability and flexibility offered by crowdsourcing makes it an appealing method to approach a wide variety of problems. Recent advances in the computational power of typical personal computers, combined with improvements in lightweight equipment capable of recording neu- rophysiological data in real-time, present an emergent opportunity to use machine learning and real-time neurophysiological data to enable interaction with devices and applications using brain activity. Brain-computer interfaces (BCIs) are systems

1 which allow an individual to control a device or application using their own brain activity. While most typical BCIs make us of electroencephalography (EEG), BCIs can function using a variety of sensors and physiological sources of data. BCIs offer alternative methods of user interaction with complex systems, as well as provide a means of interaction for those with impaired or otherwise limited physical movement that prevents the full use of a mouse, keyboard, or touchscreen. Due to their nature of relying on brain signal information rather than just the phys- ical gestures of a given user, sufficiently advanced BCIs could capture nuances and intentions of the user that are otherwise lost or compressed when user intentions are “translated” into physical inputs. Physical inputs, while precise and easy to mea- sure, may not capture the full intentions or interests of the user. Beyond merely en- abling replacing or augmenting interaction with applications via keyboard or mouse, neurophysiological data captured in real-time shows promising applications within the information retrieval community. Moshfeghi et al. (2013) showed that expo- sure to relevant content produces neurological activity distinguishable from activity produced from non-relevant content using functional magnetic resonance imaging (fMRI) [13]. Furthermore, Eugster et al. (2014, 2016) demonstrated that EEG data can be used to determine the of textual content with respect to a topic of interest. [14, 15]. Brainsourcing, presented in this thesis, is an approach that allows for crowd- sourcing of information through the use of brain signals collected from many individ- uals. While Blankertz et al. (2011) and others have demonstrated that information can be inferred from brain signals collected via EEG [16], the performance of sys- tems using such information varies widely between individuals, ranging from 50-80% classification accuracy per individual, compared to average individual accuracies of around 95% for explicit classification tasks [17]. This significant performance gap leaves brain-based opinion inference as noncompetitive with respect to explicit opin- ion mining. In this thesis, we hypothesize that brainsourcing can alleviate this gap by applying a crowdsourcing paradigm to brain-based predictions. Brainsourcing combines individual predictions inferred from multiple users into a single, more ac- curate prediction. It is proposed that such group-based classification should achieve better classification results than any model trained on a single individual. Addi- tionally, matrix factorization techniques are applied to EEG-based predictions to motivate further exploration of brain-based content recommendation systems. In this thesis, it is investigated how EEG data collected from individuals can be used to train classifiers that infer relevance of stimuli, how the results of such classifiers can be combined to create significantly more accurate classifiers, and how raw EEG data may be used in collaborative filtering techniques to predict EEG patterns. The following hypotheses are tested:

H1: Stimuli classes can be predicted through differences in EEG signals collected from individuals as they are visually presented with the stimuli.

H2: Stimuli class predictions derived from the brain signals of multiple individuals

2 can be combined to produce more accurate predictions.

H3: Matrix factorization techniques applied to predictions derived from the brain signals of multiple individuals can produce accurate predictions for stimuli in which brain signal observations are not available.

Previous research has consistently demonstrated that EEG captures salient fea- tures that correspond to underlying neurological activity with applications beyond the medical domain. From the perspective of using EEG data to classify perceived stimuli in a binary class setting, accuracy typically ranges between 60% to 80% per individual. While doing this through brain signals alone may be impressive, this has previously been achieved only by using stimuli of distinct categories, such as an image dataset consisting of animals and non-animals. Using brain signals to distin- guish between images of two separate classes within the same meta-category (such as identifying hair color in a dataset of faces) has not been demonstrated before. The benefit of conducting classification using brain signals can be derived from its speed - typically subjects can be presented with a stimulus every 500ms. Addi- tionally, presentation and classification of the stimuli can be achieved with relatively little effort from participants - they are instructed to simply observe a computer screen with a particular target in mind (“look for faces that are smiling”) as stimuli are rapidly presented to them. Crowdsourced data annotations via Amazon Mechanical Turk achieve accura- cies from 30% to 80% in various contexts [18, 19], however such tasks are usually quite complex, e.g. of a body of text. Sentiment analysis is sev- eral degrees of magnitude more complex than simply distinguishing between blond hair and dark hair. A better comparison is crowdsourced annotations of visual fea- tures, such as bounding boxes for certain objects. The labels within the Youtube- BoundingBoxes dataset are estimated to have an accuracy of approximately 95% [20]. As this thesis focuses on crowdsourcing labels of visual data, this is a suitable performance benchmark. In addition to introducing a new crowdsourcing paradigm, matrix factorization successfully applied to brainsourcing results would suggest that there are future opportunities in creating brain-based content recommendation systems. While such systems likely require widespread use of BCIs with highly accurate sensors, they are worth exploring in an experimental context. F1 classification scores exceeding 0.90 were achieved using a brainsourcing model using data from 8 participants, with asymptotic improvements to these results as data from additional participants were added. A matrix factorization approach achieved a mean F1 classification score of 0.67 across all tasks at 70% sparsity, with significant reduction in performance as level of sparsity increased. Our results sug- gest a significant addition to existing crowdsourcing techniques: mining information directly from the brains of individuals via changes in neurological electrical activity, and combining predictions from many individual users to produce a significantly more accurate classification model. Brainsourcing could allow the crowdsourcing

3 community to utilize neurological data containing novel features and latent infor- mation that otherwise cannot be captured using conventional methods. This thesis is organized as follows: Section 2 provides a detailed overview of the relevant background literature, Section 3 provides an overview of the documented experiments, including research questions, study design, and general methodologies. Section 4 answers the research questions and provide details of the experimental findings. In Section 5 we discuss the findings with the express purpose of under- standing them within a greater context. Finally, in Section 6 we provide closing remarks.

2 Background

This section provides an overview of the relevant scientific literature. Section 2.1 defines and describes crowdsourcing and its various forms. Section 2.2 defines elec- troencephalography and provides a brief overview of its history, typical equipment configurations, data properties, and common sources of noise. Section 2.3 explains in detail event-related potentials and their properties as they relate to this thesis. Section 2.4 describes the history and framework of brain-computer interfaces. Sec- tion 2.5 provides an overview of the relationship between BCIs and machine learning methods, and explains in detail how these methods may be applied to classify EEG data. Section 2.6 discusses the preprocessing and cleaning of EEG data, and Section 2.7 provides a brief overview of collaborative filtering and matrix factorization.

2.1 Crowdsourcing

Crowdsourcing refers to a distributed problem-solving and production model in which human workers, unbounded by time or location, perform designated tasks over networked computing systems [21, 22]. While these tasks are difficult to au- tomate, they can nonetheless be solved together by humans and computer systems by dividing the work among many individuals [23, 24]. Within a crowdsourcing paradigm, a large and complex problem is split into simpler components, which can then be solved by a variety of contributors. Crowdsourcing can provide complete or partial solutions to many non-trivial problems, including tasks that range from purely routine and cognitively simple such as text transcription via reCAPTCHA to complicated and creative tasks such as researching purchase decisions and col- laborative writing projects [25, 26, 27]. Independent of an actual task’s nature or purpose, a generic crowdsourcing task must be divisible into lower-level tasks, such that each lower-level task can be completed by individual members of the crowd. Crowdsourcing tasks can be divided into two primary categories: macrowork and microwork [28]. Macrowork often requires specialized skills and can take much longer to complete (such as open source development of software). Microwork typi- cally involves performing small tasks that are difficult for computers, but otherwise

4 require no specialized skills. Simple examples of microwork tasks include labeling or transcribing sections of videos [29], defining regions of images or videos that are likely to draw the visual attention of an observer [30], and recognizing features within an image that reliably facilitate a fine-grained categorization goal [31]. Microwork can be further divided into implicit and explicit tasks [32]. Explicit tasks can defined as tasks which the artifacts produced by the crowdworkers are the artifacts of interest, while implicit tasks are tasks in which the outputs may be out- side the conscious awareness of the workers. Implicit crowdsourcing is distinguished from explicit crowdsourcing such that it does not involve users directly contributing to a problem’s solution. Instead, it involves users completing a task that provides the system with information for another task, based on the user’s actions and ac- tivities recorded while completing the implicit task. Implicit crowdsourcing allows for the inference of useful information from many users simply by observing how they interact with a given system. Rather than instructing users to directly provide the information of interest, they are instead asked to perform some task while their behavior is observed and data are collected. These data are then processed and analyzed for insights and novel information.

Figure 1: Left: Screen capture of the ESP game, where users try to describe an image using the same words as their team member. Right: A version of Google’s reCAPTCHA service that has users prove they are human by selecting images that contain a specific feature (in this case, swimming pools).

There are a wide variety of forms implicit crowdsourcing can take. One example is the ESP game, where two users collaborate to provide single-word descriptions of images while being restricted by a list of “taboo” words. Words that both users suggested are then used as labels to tag Google images [33]. Another example are reCAPTCHA services, which may provide text scanned from old books that could

5 not be digitally transcribed by computer and has users complete recognition tasks involving known and unknown texts in order to prove they are human. The results are then used to digitize the scanned books [34]. Another form of the Google’s reCAPTCHA service has users select images that contain a certain feature, such as road markings or automobiles, to prove that they are human, while the results are used to featurize image datasets so they can be used to train machine vision models. Implicit crowdsourcing has also been used in estimating preferences in recom- mender systems [35], relevance assessments from search engine usage [36, 37], and even for understanding behavioral patterns from different types of search behavior [38]. Such approaches typically require users to complete a small research or fact- finding task while their cursor movements and keystrokes are recorded. This data is then analyzed to discover novel latent behaviors and features. Implicit crowdsourcing is not limited to constructing datasets for purposes of building machine learning models or designing better search engines. Foldit, a puz- zle game where players fold proteins, has facilitated the discovery of previously un- known native structures for enzymes [39] and is credited with the first crowdsourced redesign of a protein catalyst commonly used in synthetic chemistry [40].

Figure 2: Left: Screen capture of FoldIt, a physics-based puzzle game where users fold proteins. High scoring solutions are investigated by scientists to determine if novel native structure can be applied with real-world proteins. Right: From Athukorala et al. (2016), browser behavior is recorded while a user completes dif- ferent search tasks. Such data can be used to identify key differences in behavioral patterns with respect to the search task or to optimize search engine results [38].

2.2 Electroencephalograpy

Electroencephalography (EEG) is a technique used to measure voltage fluctuations that are produced by neurons in the brain. EEG first appeared in the late 19th

6 century, when in 1875 English physician Richard Caton published his findings on electrical phenomena occurring in rabbit and monkey cortices [41]. In 1890, Polish physiologist Adolf Beck published his own experimental findings where he placed electrodes directly on the surface of rabbit and dog brains and concluded that the changing electrical activity suggested the existence of brain waves [42]. In 1924, Ger- man physiologist and psychiatrist Hans Berger became the first to record electrical activity of the human brain and is credited as the inventor of the electroencephalo- gram - the record of activity as recorded via EEG [41]. Berger’s work was distin- guished from previous work like Caton’s and Beck’s, as he measured the electrical activity through the scalp, rather than through the previously used invasive methods that required surgical application of the electrodes. As with Berger’s work, modern EEG is typically conducted non-invasively via electrodes placed on the surface of the scalp. EEG was refined throughout the 1930s by Edgar Adrian and Bryan Matthews who studied abnormalities in Berger rhythms, while Gibbs, Davis and Lennox demonstrated EEG could be used to detect epilepsy [43]. In the 1960s, William Grey Walter discovered the contingent negative variation (CNV) effect, a negative spike of electrical activity which appears in the brain approximately a half of a second before a person becomes consciously aware of an intention to move. Since then, and with the aid of improved analysis tools made possible by significant ad- vancements in computer systems, EEG has been used clinically to study differing brain patterns in those suffering from sleep disorders [44], schizophrenia [45], and autism [46, 47], among many other neurological conditions. In addition, EEG has been used extensively to study human cognition, ranging from altered states of con- sciousness induced by meditation [48] and psychoactive chemicals [49, 50, 51], to problem solving [52], language [53, 54], and human emotions [55, 56].

Figure 3: Left: The EEG montage containing approximate scalp locations of elec- trodes used throughout the neurophysiological experiment presented in this thesis. Middle: Approximate brain regions relevant to the neurophysiological experiment (superior view). Right: Approximate brain brain regions relevant to the neurophys- iological experiment (lateral view).

7 2.2.1 EEG Equipment

Compared to other methods for measuring neurological activity, EEG equipment is relatively affordable and offers high temporal resolution and mobility. Magnetoen- cephalography (MEG), another technique that measures electrical activity of the brain, requires shielding from external magnetic signals (such as Earth’s magnetic field) and requires the subject to remain seated. MEG equipment (which can cost upwards of 2 million USD), is typically housed in a special magnetically sealed room, further limiting its mobility. EEG equipment, by contrast, is quite mobile, does not require a special room or facility, and the effects of most electrical interference can be mitigated through basic preprocessing techniques applied to the collected data. Functional near-infrared spectroscopy (fNIRS), another technique to measure neu- ronal activity, offers similar mobility and spatial resolutions, however it has lower temporal resolution than EEG. Due to EEG’s cost-effectiveness, mobility, and long history within the field of neuroscience, EEG has been used to produce an extensive body of scientific literature. Modern EEG equipment consists of a cap covering most of the head that con- tains electrodes which measure voltage fluctuations across the scalp. Not all EEG configurations are the same - some experimenters may rely upon the use of 32 or more electrode sites, while others use 10 or fewer. While the number of electrodes can vary across EEG configurations, they have standadized locations and labels [57]. Figure 3 (left) displays the locations and labels of 32 electrodes used in the experiments described within this thesis. Electrode labels are abbreviated in a man- ner such that locating an individual electrode is a matter of recognizing the brain area over which it is placed. The regions are denoted as follows: Pre-frontal (Fp), Frontal (F), Central (C), Temporal (T), Parietal (P), and Occipital (O). Electrodes that are placed between these regions have labels combining the associated regions: Frontal/Central (FC), Frontal/Temporal (FT), Central/Parietal (CP), and Tempo- ral/Parietal (TP). Additional sites include the external occipital protuberance or inion (I), located at the back of the skull, and the nasion (N), located at the depres- sion in the skull above the bridge of the nose and between the eyes. The numbers associated with each label indicate an electrode’s lateral location. Even numbers are used for electrodes placed on the right hemisphere while odd numbers are used for electrodes placed on the left hemisphere. Electrodes placed at the lateral center use the letter ’z’ instead of a number. While it is common to refer to measured scalp voltages by a single electrode site, standard EEG equipment utilizes a differential amplifier, which records the difference between two or more inputs before amplifying the associated signal. Therefore, the measured voltages from a single electrode site are always a difference relative to voltages measured elsewhere. These differences are typically calculated using one of two methods: common reference derivation, and average reference derivation. Common reference derivation calculates the difference between a reference site (often placed near the ears) and the electrode in question. Average reference derivation calculates the difference between a single electrode and the average voltage of all

8 other electrodes on the scalp. When using common reference derivation, reference sites must be carefully se- lected to avoid introducing a polarity bias or additional noise in the recordings. For example, reference sites along the jaw can be problematic as they will pick up electrical activity related to talking and regular clenching behavior, while reference sites that are placed on one side of the head will most likely produce an artificial asymmetry between hemispheres [58].

2.2.2 Properties of EEG Data

EEG data are continuous recording of voltages from multiple electrodes placed on the surface of the scalp. Thus, EEG data can be described using both spatial and temporal features. The spatial resolution of EEG data is rather poor when compared with other brain imaging techniques, such as functional magnetic resonance imaging (fMRI). Despite electrodes being placed in standardized locations that are associated with particular brain regions, localization of measured EEG phenomena at a given electrode site is not necessarily a consequence of neurological activity occurring beneath the electrode. The signals recorded from an electrode are a result of many electric potentials generated by millions of neurons - using EEG to attribute any recorded event to a particular group of neurons is not feasible. Additionally, the conductivity of the brain and skull is heterogeneous in nature. As the associated electrical potentials produced by neurons propagate throughout the brain and reach the scalp, their properties are subject to interference and change. Thus, special mathematical techniques must be used in order to estimate the localization of these events [59].

Neural Oscillations The high temporal resolutions makes EEG suitable to de- tect a variety of neural oscillations and other phenomena that can be used to qualify different mental states. Neural oscillations are grouped by frequency bands. Delta waves, the slowest detected neural oscillation, are associated with deep sleep and occur between 0.5 and 4 Hz [60]. Theta waves (4-8 Hz) are not yet well-understood, although they have been linked to arousal [61], location within an environment [62], as well as learning and memory [63]. Alpha waves (8-12 Hz) are associated with relaxation and are inversely associated with focused attention and sleep quality. In other words, the better rested and the more engaged a subject is when completing a particular task, the lower the amplitudes of their alpha waves. Alpha waves can serve as both a source of noise as well as a component to other neurological phenom- ena. Alpha waves typically refer to patterns that originate from the visual cortex. Mu waves, the motor counterpart to alpha waves, are found over the motor cortex [58]. Beta waves (12.5-30 Hz) are associated with normal waking consciousness and alertness [64]. The function of gamma waves, the fastest oscillations that occur between 25 and 140 Hz, is disputed [65], however they have been linked to atten- tion, object recognition [66], and working memory [67]. Abnormalities in gamma waves are associated with a variety of cognitive disorders, such as schizophrenia [68],

9 Alzheimer’s disease [69], as well as bipolar [70] and major depressive [71] disorders.

2.2.3 Artifacts

This section introduces the common noise sources typically found within EEG recordings. While EEG is designed to detect electrical phenomena produced by neurons within the brain, all non-invasive approaches use electrodes placed on the scalp to measure voltage fluctuations. Electrical activity at the surface of the scalp is produced by a wide array of physiological phenomena - neural activity consti- tutes only a small fraction of detectable fluctuations in scalp voltage. Thus, EEG recordings also capture a significant amount of noise and activity unrelated to neu- ral processes. These sources of noise are known as artifacts, and if not properly accounted for, can prevent proper analysis and even lead to incorrect conclusions being drawn from the data. In order maximize the signal-to-noise ratio within EEG recordings necessary to produce reliable and reproducible findings, it is critical that we understand potential noise sources and how they can be isolated from EEG signals.

Physiological Artifacts Physiological artifacts are produced by a variety of bi- ological sources. Muscular activity, such as jaw clenching or head and hand move- ments, will generate electric potentials as the associated muscles contract. Eye movements and blinking are the most common sources of physiological artifacts. A stationary eye forms a dipole, which produces a voltage gradient across the scalp. As the eye moves, this gradient shifts, with more positive voltages appearing on the side of the scalp towards where the eyes are pointed [72]. Similarly, eye blinks cause vertical movements, which produce artifacts most noticeable at electrodes placed towards the front of the scalp. During an EEG experiment, it is common to record eye movements with electrooculography (EOG) in order to aid with data cleaning. It is also possible to reduce these physiological sources of noise by simply instruct- ing subjects to avoid unnecessary arm, hand, and mouth movements, and to reduce unnecessary blinking. When feasible, designing experiments that do not require physical inputs from the user or that keep a subject’s gaze fixed at a single point can also reduce these sources of noise. There are also sources of physiological noise that are completely beyond the control of a subject. The salt in human sweat changes the conductivity of the scalp and associated reference sites, causing gradual changes in voltage measurements as subjects perspire. Respiration is also a source of noise, as it causes cyclical movements of the head and chest, which may also cause slow waves in the EEG recordings. A subject’s pulse may also produce regular artifacts if an electrode is placed near an artery or blood vessel. [73].

Non-physiological Artifacts In addition to physiological sources of noise, noise can be produced from a variety of external and experimental conditions. Line noise

10 from electrical infrastructure and equipment (such as computers within the exper- imentation room) can generate small amounts of constant noise (typically within the 50-60 Hz range) that is detected by the EEG equipment. Artifacts may also be generated from an EEG cap that is not properly secured to the subject’s head. Electrodes must remain still throughout data collection; any movement can cause the measured voltages to fluctuate significantly. Furthermore, electrodes that are not placed close enough to the scalp may not record any voltage changes at all. Arti- facts generated from improperly securing or applying electrodes are easy to identify, as they are usually only present in certain channels. Depending on the extent and severity of the noise, these types of artifacts can be difficult to correct for, and can involve simply dropping the affected channels from the analysis.

2.3 Event-related Potentials

Event-related potentials (ERPs) are brain responses to a specific stimulus, cognitive, or motor event [58]. ERPs depict the changes in recorded scalp voltages correlated with some experimental event (e.g. a face with blonde hair presented to the subject). In a typical experiment utilizing ERPs, the test subject is presented with some stimuli, such that the introduction of the stimuli can be easily linked to a point in time within the collected electroencephalogram. Stimuli may vary in their sensory modalities; in other words, if it can be perceived by a human sensory system, it can be used as a stimulus. While the stimuli used in the experiment presented here exclusively fall within the visual domain, experiments utilizing ERPs may also use stimuli that can be heard (auditory), felt (tactile), smelled (olfactory), tasted (gustatory), or are kinesthetic in nature (engagement of muscles and/or joints). In the case of visual modality, the stimuli may consist of faces, images of objects and scenery, or other content related to the research area of interest, such as words and sentences for neurolinguistic research. ERPs can be isolated by splitting a continuous EEG recording into time win- dows coinciding with the stimuli onset, known as epochs. Epochs typically consist of EEG data 100-250 ms before a stimulus and 500-1200 ms after a stimulus. The epoching of EEG data is depicted in Figure 4. ERPs are characterized by positive and negative voltage fluctuations and their associated peaks, referred to as ERP components. ERP components are named using a letter indicating the polarity of the peak (N for negative and P for positive), followed by either the latency of the peak in milliseconds relative to stimulus onset or an ordinal indicating when the peak has occurred. In this case, components can be translated using multiples of 100. Thus the P1 component is roughly equivalent to the P100, the P2 = P200, and so on. These naming conventions are approximations: the P300 component may occur between 300 ms and 800 ms, depending on the task, stimulus, and subject. It is not uncommon for ERP components to have a latency that does not quite match what their description suggests (for example, the N400

11 Figure 4: Epoching of EEG data. Each rectangle represents an epoch, while the green vertical lines indicate onset of a stimulus. The vertical axis denotes changes in voltage. In this example, EEG data are distinct for each epoch, however it is also common for epochs to contain overlapping EEG data. component can occur before the P300 component). Since ERPs are isolated slices continuous voltage fluctuations, ERP components are often heavily influenced by components that precede them, including components from different epochs. A strong positive component that occurs before a negative component will affect the amplitude of the negative component, and vice versa. Voltage peaks that occur before each ERP component can also affect the latency of the next ERP component, and changes in latency can in turn cause additional changes in peak amplitudes. This makes isolating individual ERP components quite difficult, as there are multiple confounding factors that influence how a component may be expressed. Due to these factors, is worth bearing in mind that ERP components are not necessarily identified from their peaks alone. An ERP component requires an analy- sis of multiple features in order to be properly isolated and identified; a component’s associated peak is simply an additional salient feature that is often easiest to identify upon first inspection [58]. There are three types of ERP components: exogenous, endogenous, and motor. Exogenous components are affected by the presence of the stimulus itself and its associated characteristics (such as an image’s brightness or the volume of a tone). The N100 is an example of an exogenous component and can be found at frontal, parietal, occipital, and especially at central recording sites [74]. Endogenous com- ponents are typically unaffected by stimulus modality or characteristics and are associated with higher cognitive processing. Endogenous components usually occur later than exogenous components. The P300 is an endogenous component. Motor components are produced during the preparation and execution of a motor move- ment. The readiness potential (RP), original German bereitschaftspotential, is an example of a motor component. It is characterized by a negative shift at the front and central electrode sites preceeding a physical response up to a full second before the actual physical movement is made. The RP’s lateralized derivative lateralized readiness potential (LRP) is easy to isolate from other ERP components and has

12 been widely used in cognitive studies [75, 76, 77]. Next, we provide an overview of ERP components relevant to this thesis. A much more thorough discussion of other ERP components, including interpretations and relevant experiments, can be found in [58].

2.3.1 P300 Component

The first report documenting the P300 component (sometimes referred to as the P3 component) was published by Sutton et al. (1965), which found an association between a parietal scalp positivity occurring around 300ms that varies in magni- tude based on how predictable or probable a stimulus is [78]. Duncan-Johnson and Dunchin (1977) further explored this and found that the less common a relevant stimulus is, the larger the P300 wave it produces when presented to a subject [79]. In other words, when looking for a specific target stimulus, the more non-target stimuli seen before the target stimulus of interest appears, the larger the positiv- ity produced in response to the target stimulus. Exploiting this phenomenon now known as the oddball paradigm, research by Squires et al. (1975) distinguished two types of P300 component - the P3a and the P3b [80]. The P3a component appears as a maximal positivity at the scalp over the front lobe, whereas the P3b component is a maximal positivity measured at the top rear of the scalp above the parietal lobe. Typically, when the P300 or P3 component appears in a research text, the authors are referring to the P3b component. For the sake of consistency we will use P300 to refer to the P3b component throughout the rest of this text. Since Sutton’s preliminary work, multiple theories of the P300’s neurological origins have been proposed, although there still is no clear consensus as to how or why the P300 is produced. A theory first popularized by Donchin (1981) argues that the P300 component is associated with context updating and the allocation of attention [81]. The more attentional resources used for a given task, the larger the amplitude of an evoked P300. The converse is also true: P300 amplitudes tend to be smaller if a subject is allocating their attention to multiple unrelated tasks. The P300’s latency is directly associated with a task’s difficulty and the cog- nitive abilities of an individual [82]. The relevance of stimuli to a given task also matters: the cognitive processing of irrelevant stimuli (such as a sequence of dark haired faces in a task where the user is instructed to look for faces with blond hair) produces smaller P300 amplitudes than stimuli that are relevant (in this example, faces with blond hair). The P300’s peak latency can vary dramatically, ranging from 250 ms to 500 ms or even longer depending on the task. This latency also varies between individuals and is influenced by a variety of factors. Age and the progression of certain neurological diseases, like dementia, increases P300 latency [83]. Cognitive abilities also influence the P300’s latency - individuals with larger working memories exhibit a shorter average P300 peak latency when compared to individuals with smaller working memories [84]. In context-updating theory, the P300 is an indication of a mental abstraction of

13 the stimulus being updated. While some work mistakingly assumes context updat- ing refers to working memory, this was not Donchin’s intended meaning, although experiments that assume a shared relationship between the P300 component and working memory updates have nonetheless achieved reasonable results [58]. While the specific details still remain to be settled, it can be safely assumed that the P300 does reflect some mental model being updated after an unexpected event or stimulus. More recently, distinctions in P300 components have been used to assess rele- vance of content via word informativeness within a sentence [15, 85]. In [85], high- information words, such as “India” or “genome”, elicit ERP components that are distinct in nature from ERP components elicited by low-information words, such as “is” and “from”. The informativeness of a given word (as determined by the en- tropy of its associated distribution within a language model) associated with larger positivity in the P200, P300, and P600 components.

2.3.2 N170 Component

The N170 was first recognized by Jeffreys (1989), when he compared ERPs pro- duced by face and non-face stimuli and noticed a small difference at central midline sites from 150 ms to 200 ms [86]. The N170 component indicates that the average human brain can distinguish between faces and non-faces within approximately 150 milliseconds. This response may be slower when the faces presented are obscured or otherwise difficult to immediately distinguish. Segmentation of faces into their individual components can also increase latency of the response, however this gener- ally only applies when the features are shuffled in a manner such that typical facial structures are no longer preserved. Visual processing of faces is considered to be at least somewhat automatic [87]. Rossion & Jacques (2012) found that the N170 component is indeed related to face perception, rather than as a product of process- ing lower-level features that happen to appear more frequently in faces [88]. The fusiform gyrus, a region of the brain associated with facial processing, is attributed as a likely contributor to the N170 component [89, 90, 91]. The N170 is not just associated with facial recognition. Recent research has suggested that it is also linked to some neural mechanism related to visual domain expertise. Tanaka et al. (2001) found differences in the N170 response when present- ing bird and dog experts depending on the image they were shown. Dog experts had more pronounced measured N170 components when viewing images of dogs than of birds, while the measured N170 components of bird experts were stronger for im- ages of birds than of dogs. In both instances, the N170 components were stronger in experts than in non-experts presented with the same images [92]. In experiments conducted by Rossion et al., simultaneous presentation of a face stimulus alongside a stimulus for some item of expertise resulted in a reduce N170 response elicited by the face stimulus, suggesting these processes produce competition for the same neural circuitry [93, 94].

14 2.4 Brain-computer Interfaces

A brain-computer interface (BCI) can be defined as a device that allows for inter- action with a machine or system using neurological activity. The term was first introduced in 1973 by Jacques Vidal [95]. Using a non-invasive approach that relied upon EEG recordings, his work was significantly limited by the computational power available. The paper demonstrated a proof of concept, although without practical results. In 1977, Vidal published work that demonstrated the first human-controlled BCI application using EEG, where a user controlled a cursor to navigate through a simple digital maze [96]. While they typically employ data derived exclusively from neurological sources, BCIs may be enhanced with additional physical information, such as eye movements [97], heart rate [98, 99], and skin conductivity [100].

2.4.1 Invasive and Non-invasive BCIs

BCIs can be non-invasive (EEG, fNIRS) and invasive (implanted in the brain). Invasive BCIs, while effective and capable of capturing more accurate data from the brain, are prone to complications and can gradually lose their effectiveness over time as scar tissue builds up around the implantation site [101]. Semi-invasive BCIs, such as those utilizing electrocorticography, offer similar data quality and are less prone to complications [102]. In electrocorticography, electrodes are placed beneath the dura mater (the outermost of three protective membranes surrounding the brain and the inside of the skull). Due to the risk of medical complications and the high surgical costs, invasive and semi-invasive BCIs are typically more suitable for individuals with a defined medical need for such a device. Non-invasive BCIs, on the other hand, do not require surgery and may utilize a variety of equipment types, ranging from affordable and consumer-grade wearable headsets for simple applications when low-precision readings are sufficient [103], to more expensive research-grade equipment that require longer setup times but also offer more accurate recordings. While non-invasive BCIs could theoretically use any type of neuroimaging approach, EEG, fNIRS, and MEG are the most popular, as they offer the temporal resolutions necessary for correlation of neural features with external events in real-time (thus enabling real-time interaction by the subject).

2.4.2 Active, Reactive, and Passive BCIs

BCIs can also be classified into three categories based on how a user may interact with them, as defined by Zander and Kothe (2011): active, reactive and passive [104]. With active BCIs, users must exert some level of control over their brain activity. In the Hex-O-Spell, a mental typewriter developed by Blankertz et al. (2006), this is done by imagining various motor movements; a user imagines a right hand movement to move a cursor clockwise, and imagines a right foot movement to select a character that the arrow is pointed at [105]. In a reactive BCI, the application is controlled using the user’s brain activity

15 in response to a stimulus, often modulated by the user. In RelaWorld, a virtual reality meditation system, neurofeedback is used to guide subjects through various meditation practices [106]. As the user progresses in the exercises (measured by alpha and theta activity), a virtual “energy bubble” of increasing opacity appears around the user as they become more relaxed, and their relative height within the virtual space increases as they become more focused. Users of the system reported higher feelings of well-being, concentration, and presence than controls. In Eugster et al. (2016), a proof-of-concept application recommends a Wikipedia article based on its perceived relevance with respect to a topic word of interest, as measured by the brain activity of participants as they read the article’s introduction [15]. In a passive BCI, brain activity is recorded and utilized without any conscious effort from the user. Khan and Hong (2015) presented a passive BCI utilizing fNIRs that is worn by users in a driving simulation. As the user navigates around a city, their brain signals are recorded and analyzed; if the user’s brain signals indicate they are becoming drowsy, an auditory warning is played [107].

2.4.3 Assistive and Non-Assistive BCIs

Much of the initial motivation to develop sophisticated BCI applications was to provide assistive systems for the disabled [108], particularly in disabled individuals with progressive diseases that may influence their ability to use assistive technologies requiring physical inputs [109]. Assistive BCIs may be invasive [110] or non-invasive [111]. The general principal behind assistive BCIs is to facilitate interaction between a person with limiting conditions and the physical or digital world. While there has been some success and progress since the early 2000s, assistive BCIs do not necessarily perform better than other assistive technologies that do not attempt to harness brain activity. A notable example of a BCI failure is with Astrophysicist Stephen Hawking. Hawking, who suffered from Amyotrophic Lateral Sclerosis (ALS), a progressive neu- rogenerative disease that increasingly limits the muscle movements of those afflicted with the condition, eventually lost his ability to speak due to the progression of the disease. Hawking was nonetheless granted some degree of autonomy through the use of a motorized wheelchair and computer system that would speak sentences he wrote using the remaining muscle functionalities he still had in his hands [112]. As the disease progressed, it became increasingly difficult for Hawking to use the assistive system. Communicating complex ideas eventually took far too long to allow for Hawking to engage in ordinary discussion. From 2011 to 2012, attempts were made by the company Neurovigil to implement a BCI utilizing scalp electrodes that would allow Hawking to control his computer using motor imagery brain responses (produced when Hawking imagined the movement of his left or right hand) [113]. However, the developers could not get the device to record a signal accurately enough as to reliably control Hawking’s computer system, and the project was abandoned [114]. A collaboration with Intel eventually produced the Assistive Context Aware

16 Toolkit, which used a mixture of specialized UI design and predictive text input to more than double the rate at which Hawking could communicate his ideas [115]. This system still required Hawking to use the muscles in his face, rather than his brain signals, to control the application. While Neurovigil’s BCI for Hawking failed to achieve the desired quality of results, attempts to produce BCIs for patients with disabilities (including ALS) have demonstrated promise [116, 117, 118], including non-invasive applications that function outside a laboratory and are suitable for everyday use [119]. BCIs for healthy users have been less popular, as they typically do not provide a superior alternative to existing modes of interaction. Most BCIs intended for healthy users are non-invasive, and thus suffer from poor spatial resolution and low signal- to-noise ratio, which leads to poorer performance. For example, users of the Hex-O- Spell were capable of writing less than 1 word per minute [105]. Currently, this is not remotely competitive with other input methods, where typing speeds of around 36 words-per-minute are achieved on mobile phones with touch screens (SD = 13.22, N = 37370) [120], and 52 words-per-minute (SD = 20.20, N = 168960) achieved using a physical keyboard [121]. Another common limitation for non-assistive BCIs for healthy individuals is cost-benefit ratio - most EEG systems used in academic research are beyond the price range for the casual able-bodied hobbyist. More affordable consumer-grade systems, such as the Muse headband, collect EEG data of sufficient quality as to adequately visualize various ERP components (including the P300), making them theoretically suitable for real-time BCI applications [122]. Such findings should be carefully considered, as the nature and quality of the components captured by such consumer-grade wearables significantly differs from those captured by laboratory- grade equipment [122]. Furthermore, studies that use data collected from consumer- grade wearables typically rely upon slower deep learning models to accurately classify the recorded signals, making such approaches unsuitable for real-time applications [123, 124]. Since their first introduction in the 1970s, BCIs have changed dramatically in scope and capability. As computational and brain imaging technologies improved, more advanced applications of BCIs have become possible. In Jacques Vidal’s pre- liminary work, the main computer his team used was an IBM 360/9. which was equipped with, in Vidal’s words, an “exceptionally large core memory of 4M bytes”. With the advent of powerful consumer hardware, real-time BCI applications have become much easier to develop and test, such that they are more limited by data quality and sensor accuracy rather than computational power.

2.5 Machine Learning for BCIs

Most non-invasive BCIs typically require some statistical classification technique applied to collected data in order to function. There are a variety of methods that can be used, from support vector machines (SVMs) to neural networks, but

17 here we will focus our discussion on linear discriminant analysis (LDA). LDA is computationally simpler than SVMs and neural networks, and our work focuses on ERP studies, which usually involve a binary classification problem. LDA is also robust to unbalanced data, and has demonstrated competitive results in previous work [16, 125]. A more comprehensive review of classification techniques within the context of EEG-based BCIs, from simple to computationally advanced, can be found in Lotte et al. (2007) and Lotte et al. (2018) [125, 126].

2.5.1 Linear Discriminant Analysis

LDA is a and classification technique that seeks to find a linear combination of features that can separate two or more classes of data. It is a supervised method which finds component axes to maximize the separation between classes. This can be contrasted with principal component analysis (PCA), an unsupervised method which finds component axes that maximize variance across the entire data set. LDA utilizes a generalization of Fisher’s linear discriminant [127] with several differences, such as the assumption that the data is normally distributed and the classes are linearly separable (e.g. they share the same covariance matrix). The general procedure of LDA for classification is separated into two parts: di- mensionality reduction and classification. For dimensionality reduction, LDA is used to learn a weight vector that projects high dimensional data into a lower dimension such that it maximizes between-class distance and minimizes within-class variation. For classification, unseen data samples are projected into a lower dimension using the learned weight vector. A visualization of LDA’s key components is provided in Figure 5.

(a) (b) (c) (d)

Figure 5: LDA learns a dimension to project data into that maximizes class sepa- ration. In (a), the class overlap between data in both dimensions is high. In (b), a new dimension is learned, which in (c) the data is projected onto (c). In (d), the resulting distributions of the two classes are visualized in the new dimension, which have considerably less overlap.

18 The linear discriminant function1 which projects the data into lower dimensions can be written as follows: y(x) = WT x (1) where x is the input vector and W is the projection vector (also referred to as the weight vector). Learning the components for W to maximize separation between classes and minimize within-class variance is comprised of four steps. First, the d-dimensional mean vectors for the different classes are computed: 1 m = X x (2) k N n k n∈C1 where xn is data sample n, Nk are the number of samples for class k, and mk is the computed mean vector. Second, the within-class and in-between-class covariance matrices are computed. The within-class covariance matrix SW and the between- class covariance matrix SB can be computed with Equations 3 and 4

K X X T SW = (xn − mk)(xn − mk) (3) k=1 n∈Ck

K X T SB = Nk(mk − m)(mk − m) (4) k=1 where m is the overall mean, mk is the mean for class k, and Nk are the number of samples in class k. The third step is to formulate and solve a criterion to determine class separa- bility, such that the outcome is maximized when the covariance between classes is large and the covariance within classes is small. For this example, the criterion in Equation 5 are used (for the full derivation, see [129]).

n −1 o J(W) = T r SW SB (5)

−1 Next, the eigenvectors and eigenvalues are calculated for the square matrix SW SB. Finally, the eigenvectors are sorted by their associated eigenvalues, select the largest D0 eigenvalues and their associated eigenvectors, and create a D0 × d dimensional matrix W. For binary classification, the class of a new sample xn can be determined by solving T µ1 + µ2 P (C1) W (xn − ) > log (6) 2 P (C2) where xn is a data sample, µ1 and µ2 are the mean vectors for the respective classes, and the probabilities for class C1 and C2 are estimated from the dataset.

1This formulation is based on work from [128, 129]. For a Bayesian approach, see [130].

19 2.5.2 Classification of EEG Data Using LDA

Blankertz et al. 2011 demonstrated that it is feasible to optimize linear discriminant analysis with careful use of shrinkage estimators to produce competitive classifica- tion results for ERPs [16]. The dimensionality of EEG signal data tends to introduce difficulties in estimating the covariance matrix, which therefore leads to poorer per- formance of standard LDA models. This is because with EEG signal data, the ratio of features to samples is very high, which results in large eigenvalues estimated as being even larger, while small eigenvalues estimated as even smaller. LDA with shrinkage, also known as regularized LDA, helps manage this “curse of dimension- ality” problem.

Assumptions EEG data across channels and subjects usually follows a normal distribution [58, 16]. LDA is also rather robust to data that violates this assumption of normality. LDA assumes a shared covariance matrix between classes, therefore it also assumes that the noise and variance present within the data is caused by factors unrelated to differing experimental conditions. Thus, it is important to control for conditions during experimental data collection that may introduce bias or correlative noise that otherwise would not exist. Suppose a subject is being presented with images that are either graphically offensive or pleasing in nature. If the subject frequently blinks when they see an offensive image, but rarely when viewing a pleasant image, the covariance matrix will fail to capture this noise, as it is correlated with a variation in experimental or subject conditions.

Featurization In order to find features within the ERPs, the epochs for each subject must be converted into vectorized representations. Given a 32-channel EEG configuration running at 2000 Hz with epochs 500 ms in length, this produces a data with the shape n × 32 × 1000. The spatial and temporal dimensions are concatenated, producing a matrix n × 32000. To train our LDA classifier, class labels are needed for the data. In this case, labels are assigned to each epoch based on the presented stimuli. For task “look for blond faces”, the associated epoch for faces that have blond hair will be assigned the class label of 1, while the remaining epochs for stimuli without blond hair will be assigned the class label of 0. With these labels and the associated vectorized data, the parameters of the LDA model can be estimated and tested with new data unseen during training to determine model performance. These training and testing steps are repeated for all subjects for which data is available.

2.5.3 Comparing EEG Data Between Subjects and Trials

Due to individual differences in electrode placement, skull shape, hair thickness, and neurophysiology, EEG recordings and associated ERPs are unique to the individ-

20 ual. Data from two subjects and even the same subject collected in different trials cannot be trivially combined and used interchangeably. While it is acceptable to average data from many subjects in order to form a better general understanding of a neurological response to an experimental condition, EEG data from multiple sub- jects cannot easily be used to train a single classification model. While training such a model is possible, as demonstrated by Kang & Choi (2014), these experimental results relied upon a mathematically complex model trained using EEG data pro- duced from motor imagery tasks (such as asking subjects to imagine moving their left or right hand) rather than the implicit response produced from viewing visual stimuli [131].

2.6 EEG Data Preprocessing

In order to produce reliable predictions from EEG data, most machine learning ap- proaches require data with a high signal-to-noise ratio. This means preprocessing the data to remove as many artifacts and sources of noise as possible. While there are ways to reduce the likelihood of certain types of artifacts (such as by instruct- ing subjects to sit still and avoid blinking when possible), artifacts and noise are nonetheless inevitable in EEG recordings. Removal of such artifacts varies depend- ing on their properties. Small and constant artifacts (e.g. line signal noise) require different techniques than artifacts that are large and irregular (e.g. eye movements). In this section, we will discuss in detail methods which can be used to deal with both types of artifacts. A more thorough discussion can be found in Chapter 6 of [58].

2.6.1 Filtering

The first and most common technique used to clean EEG data is to apply various filters so that only frequencies of interest are left within the data. Notch filters, which attenuate frequencies within a certain range while leaving other frequencies unaffected, are a good choice for regular oscillations, such as noise from power lines and electrical systems (typically occurring in the 50-60 Hz range). In some circumstances, low-pass filtering is also an effective method, however it will also remove all oscillations that occur above the cutoff frequency, so it is not suitable for all EEG studies. In ERP studies, the oscillations of interest occur at frequencies lower than 60 Hz - making low-pass filters a viable data cleaning method. To remove low frequency oscillations, such as those caused by breathing, a high-pass filter may be used. Care should be given to applying high-pass filters with cutoffs of 0.3 Hz or higher, as they may actually introduce bias into the cleaned data, which can adversely affect research conclusions [132]. Note that combining a high-pass and low-pass filter is, in principle, the same thing as a band-pass filter. Various filters applied to signal data and their results are visualized in Figure 6.

21 (a) Unfiltered signal (b) High-pass filter

(c) High-pass and low-pass filters (d) High-pass, low-pass, and notch filters

Figure 6: Visualization of various filter types applied to audio signals. In (b), a high-pass filter attenuates frequencies that occur below a cutoff frequency (for comparison, the original signal is shown as a transparency). In (c), the addition of a low-pass filter attenuates signals that occur above a given frequency. In (d), a notch filter only removes signals within a narrowly defined band (in this example, at 60Hz).

2.6.2 Baseline Correction

Recall that since EEG data is a continuous recording, each epoch may contain volt- age offsets introduced by events from previous epochs. These voltage offsets, if not properly accounted for, can prevent effective analysis of the data by introducing a significant amount of variance and making it difficult to compare epochs. Voltage offsets can be produced by a variety of sources, such as from ERP components in a previous epoch and gradual changes in recorded voltages due to perspiration or the drying of conductive electrode gel. Baseline correction can also reduce effects from eyeblinks and other movements, although these corrected results are not al- ways consistent (see Chapter 8 of [58] for more details). The most common baseline correction technique is to take the average voltage of an epoch’s pre-stimulus pe- riod and substract it from the remaining data points within that epoch. Provided that pre-stimulus periods of each epoch do not vary significantly due to differences in experimental setup or improperly standardized stimuli, this technique produces epochs that have fairly similar average voltages. If pre-stimulus periods contain residual ERP components from previous epochs or artifacts such as eyeblinks, the baseline correction may produce adverse results that bias the data [133].

22 2.6.3 Artifact Rejection and Correction

While filtering and baseline correction can reduce noise within EEG data, certain noise or artifacts are simply too large or persistent for these techniques to be of any use. In these cases, so long as it not necessary to process the data in an “on-line", real-time manner, techniques may be applied to simply remove affected epochs or channels (artifact rejection) or to mathematically correct the observed noise (artifact correction). Such techniques are useful for EEG research, however they are not typically suited for BCIs, as they often involve manual input from a trained researcher, or require the entire dataset to be considered at once.

Artifact Rejection There are many ways to approach artifact rejection, from manual techniques and simple heuristics to sophisticated mathematical approaches. During visual inspection of artifacts, an experienced researcher manually screens the EEG data to identify noisy or flat signals caused by malfunctioning electrodes or other fairly recognizable phenomena. Depending on how thorough the researcher is, visual inspection can be rather slow, particularly for experiments with many subjects and trials. Additionally, it can introduce bias into the data, so it is important that the researcher conducting the visual inspection does so without any knowledge of experimental conditions associated with the specific epochs or trials they are manually inspecting. Some basic heuristics for artifact rejection include using minimum and maxi- mum voltage thresholds. Channels or epochs that exceed these thresholds can be rejected. In some instances, an epoch may contain amplitudes that could not be produced by a human brain (sometimes caused by eye blinks or muscle movements); in such cases, these epochs can be safely discarded. Another heuristic is to analyze the variance of each channel or epoch. Channels or epochs that have unusually high variance may be rejected as overly noisy, whereas those with unusually low variance may be rejected as failing to record a proper signal. Due to individual physiological and neurological differences, defining the vari- ance or voltage thresholds to use for artifact rejection is not always straightforward. Subjects with thicker skulls or particularly dense hair will likely have lower recorded scalp voltages than subjects with thinner skulls or hair that does not interfere with proper electrode placement. Thus, it may be necessary to first visually inspect each subject’s data and spend time defining parameters that more or less hold true throughout the remainder of the session [58].

Artifact Correction Sometimes, artifact rejection may result in too few remain- ing epochs or channels to allow for sufficient analysis of the data. In such cases, more advanced techniques may be used that attempt to correct noise and artifacts rather than remove their associated segments from the dataset. A technique popularized by Makeig et al. (1996) uses independent component analysis (ICA) to isolate artifacts to a few output channels while removing them from remaining channels [134].

23 Using ICA, noise components within the data can be visually identified by un- derstanding their relation to existing properties of EEG phenomena. First, noise components do not usually correlate with the onset of stimuli. Second, noise com- ponents often contain frequencies that exist outside of those common to neural oscillations. Third, noise components usually consist of topographies that imply a non-neural origin, e.g. the components of blinks and other eye movements are largest at the frontal lobe. Using EOG and other physiological sources of data can further improve the ICA artifact correction approach [135]. While powerful, ICA is not without its limitations. ICA requires the researcher to visually inspect components and is subject to significant interpretation. ICA works best for artifacts that occur fairly frequently and regularly - using ICA to re- move infrequent or irregular artifacts can produce unreliable results. An unscrupu- lous researcher can inadvertently remove components that may also contain signals with significant neural origin, thus further skewing the analysis. ICA also signifi- cantly alters the EEG waveform data, which can influence how the signals are in- terpreted. Thus, ICA should be applied with care and by an experienced researcher [58]. Recently, methods have been proposed that seek to automate ICA processing of EEG data, however more research is needed to determine its efficacy [136]. As with artifact rejection methods that require manual input from a researcher, ICA is not suitable for applications which respond to real-time EEG data, such as BCIs.

2.7 Collaborative Filtering

Collaborative filtering is a process where unknown information about a user or en- tity is estimated by comparing data collected from the user to data from many other users. A common application of collaborative filtering is to determine user prefer- ences and provide content recommendations [137]. While collaborative filtering can be applied in contexts outside of content recommendation systems, its application in this context has been studied widely in the literature, and thus will be the focus of this section. As collaborative filtering plays only a small role in motivating the research presented in this thesis, the discussion will be kept brief. A more thorough review of collaborative filtering and content recommendation systems can be found in [138, 139, 140]. In order to produce recommendations for a user of interest, a collaborative filtering system typically requires the user to provide preference information, such as star ratings of movies. The system then seeks to find other users that have similar preferences, and will produce estimated ratings for content the user of interest has not yet seen or rated, as depicted in Figure 7. Items that the system estimates will receive a high rating from the user are recommended to the user, who may then consume and rate the suggested content accordingly. In addition to explicitly provided preference data, collaborative filtering systems can use other sources of information, such as historical purchase data. Such systems may seek to determine whether or not a user may be interested in certain products in an online store [141].

24 Figure 7: A typical collaborative filtering problem. In this example, given item ratings from users A, B, and C, the collaborative filtering system tries to estimate user D’s ratings for items 1 and 4. While this example uses a scale rating system, binary ratings and other types of data that indicate behavioral preferences, such as purchase data and clickthrough rates, may also be used.

2.7.1 Matrix Factorization for Collaborative Filtering

A common approach in collaborative filtering systems is to apply matrix factoriza- tion techniques to user data in order to find lower dimensional representations that can be used to predict user preferences. A popular matrix factorization technique is the singular value decomposition (SVD), which has achieved competitive results in large-scale recommendation systems, including the Netflix prize [142, 143]. Using SVD with the movie example, an input matrix of users and movie ratings M is represented as a product of three matrices U, Σ, and V. This is formalized as

> M[u×i] = U[u×r]Σ[r×r](V[i×r]) (7)

where u are users, i are movies, and r is the rank of matrix M. Note that singu- lar value matrix Σ is a diagonal matrix of non-negative values, where r is sometimes referred to as “concepts”. In the context of movies, r can be conceptualized as film genres and other related elements (such as narrative tone or style). In Σ, the mag- nitude of the singular values can be approximated as the relative strength of each concept. Matrices U and V can be conceptualized as a means of translating user-to- concept and movie-to-concept, respectively. That is, U maps the relative strength of how each genre is preferred by a given user, while V maps the strength of each genre present within a given movie.

Thus, each movie is associated with a vector qi and each user with a vector pu.

25 For movie i, the vector qi quantifies the amount of each “concept” present in the movie. For user u, pu captures their interest in items containing the corresponding > factors. The interest of user u in a given item i can be calculated from qi · pu. This leads to the formal representation for rating prediction rˆui:

T rˆui = µ + bi + bu + qi pu (8)

where bi and bu are bias terms and µ is the overall item rating. In order to estimate parameters bi, bu, qi, and pu, Equation 9 is minimized:

X  T 2  2 2 2 2 min rui − µ − bi − bu − qi pu + λ bi + bu + kqik + kpuk (9) b ,q ,p ∗ ∗ ∗ (u,i)∈K

where λ is a regularization parameter. This equation can be minimized using a stochastic gradient descent approach popularized by Funk (2006) [144]. For each rating rui in the training set, a rating rˆui is predicted. The resulting error eui is simply rui − rˆui. The parameter updates, with γ as the learning rate, are computed using the following:

bu ← bu + γ (eui − λbu) (10)

bi ← bi + γ (eui − λbi) (11)

pu ← pu + γ (eui · qi − λpu) (12)

qi ← qi + γ (eui · pu − λqi) (13)

Other methods for minimizing Equation 9, such as alternating least squares, are unnecessary in the scope of our discussion. More detailed explanations of ma- trix factorization techniques for content recommendation can be found in [145] and Chapter 5 of [137].

26 3 Research Design

Our research focuses on the feasibility of using brain signals within the context of a BCI to classify content in a crowsourcing setting. The data used in this thesis was collected over the course of several months in Fall 2018 and is part of a larger research project funded by the Academy of Finland. The greater context of this research project is centered around BCI development and neuroscience research. Thus, the analyses and discussion presented in this thesis will be motivated by these two topics. This section provides a detailed discussion of our study’s design and a brief review of its motivation. Section 3.1 introduces in detail our central problem and the research questions used to answer it and Section 3.2 describes the experimental setup, data collection and preprocessing procedure, and outlines the methodologies used.

3.1 Research Questions

For clarity, the central hypotheses that motivate this research are restated below:

H1: Stimuli classes can be predicted through differences in EEG signals collected from individuals as they are visually presented with the stimuli.

H2: Stimuli class predictions derived from the brain signals of multiple individuals can be combined to produce more accurate predictions.

H3: Matrix factorization techniques applied to predictions derived from the brain signals of multiple individuals can produce accurate predictions for stimuli in which brain signal observations are not available.

To summarize, we study the performance of single-trial classification of facial stimuli and whether it is possible to improve this performance via brainsourcing, such that the approach may have pragmatic, real-world value. The application of matrix factorization to EEG-derived predictions is also explored. To answer our hypotheses, we ask the following research questions: RQ 1: Can regularized LDA be used to classify EEG signals collected within a binary class framework from a facial recognition task? RQ 2: Can EEG data collected across many individuals be combined in a crowdsourcing setting to predict a class label for a given stimulus, such that the collective performance is better than the performance of any individual user? RQ 3: How many participants are enough to reliably predict a stimulus class, and at what point does the addition of participants fail to significantly improve the results?

27 RQ 4: Can collaborative filtering using matrix factorization on the LDA clas- sifier outputs produce reasonable results such that it could be used to make class predictions for unseen stimuli for a given individual?

3.2 Experimental Setup

In this section, we outline the study design used to collect and clean the data nec- essary to answer the research questions from Section 3.1. We also explain in more detail the justifications for the study design and selection of performance measures. In order to answer the research questions, we use neurological signals measured via EEG from participants in response to a visual recognition task, in a single-trial setting. A trial is defined as a session in which electrodes are applied to the scalp of a participant, EEG data are collected while the participant completes a series of tasks, and the electrodes are removed upon completion of the requested tasks. The corresponding neurophysiological experiment, illustrated in the upper part of Figure 8, is described in this section.

3.2.1 Participants

Thirty-one volunteers, 18 males and 13 females, were recruited from the Univer- sity of Helsinki. The volunteers self-reported as being healthy with regard to neu- ropathological history. There were 2 left-handed participants while the remaining 29 participants were right-handed. Prior to participating in the neurophysiological experiment, they were fully briefed as to its nature and purpose. All participants signed informed consent in accordance with the Declaration of Helsinki, and were instructed of their rights as participants, including the right to withdraw from the experiment at any time without fear of negative consequences. One participant did in fact withdraw from the study early, which prevented data from being collected for a single task (old), resulting in full data from 30 participants. However, the remaining data for this participant were still valid and thus used to perform the analyses detailed in this thesis. The study was approved by the Ethical Board of the University of Helsinki. All 31 participants received ticket vouchers to a local cinema chain as compensation for their participation.

3.2.2 Stimuli

Stimuli were selected such that they would be relatively easy for an individual to classify using a binary label (target or non-target). It was also necessary that par- ticipants be able to discern important features in a straightforward manner, but not recognize the individual stimulus. That is, their response would be based on the stimulus’s class instead of any confounding and unrelated features (like recogniz- ing a famous person). Additionally, it was important to avoid significant variations between stimuli such as stark differences in background color, brightness, or other

28 visual features not related to the actual task. Such differences could produce unin- tended neurological effects that could negatively influence task performance. In order to avoid such effects, we decided to use images of artificially created faces as stimuli. This allowed us to use homogeneous stimuli that did not represent any known individual yet had easily recognizable features (e.g. age, hair color, facial expression). Stimuli were generated using a GAN architecture trained on a large dataset of celebrity faces, sampled via a random process from 70,000 latent vectors from a 512-dimensional multivariate normal distribution [146]. Each generated face was manually assessed by a human to ensure the faces were free of errors or unnat- ural features produced by the generation process. Generated faces of satisfactory quality were then placed into a distinct category corresponding to one of the eight recognition tasks, as presented below in Table 1. A total of 1961 unique images were used in the experiment.

Task Blond Female Young Smiling Dark-haired Male Old Not smiling Inverse Dark-haired Male Old Not smiling Blond Female Young Smiling

Table 1: All visual recognition tasks and associated inverses used in the experiment.

3.2.3 Apparatus

The LCD display used to present stimuli was positioned 60 cm from the participants, running at 60 hz with a screen resolution of 1680 by 1050 pixels. Stimuli were pre- sented using Psychology Software Tools E-Prime 3.0.3.60, which helped facilitate the optimization of display timing and EEG amplifier trigger control [147]. Each electroencephalogram was recorded from 32 Ag/AgCl electrodes, positioned on Fp1, Fp2, F7, F3, Fz, F4, F8, FC5, FC1, FC2, FC6, T7, C3, Cz, C4, T8, TP9, CP5, CP1, CP2, CP6, TP10, P7, P3, Pz, P4, P8, PO9, PO10, O1, O2, and Iz using EasyCap elastic caps (EasyCap GmbH, Herrsching, Germany). Hardware amplification, filter- ing, and digitization was achieved using a QuickAmp USB (BrainProducts GmbH, Gilching, Germany) amplifier running at 2000 Hz. The electrooculogram (EOG) was collected using two pairs of bipolar electrodes, situated 1 cm lateral to the outer canthi of both eyes, and 2 cm inferior and superior to the right pupil. The purpose of the EOG was to collect data regarding eye movements and blinking behavior; such data were used during the preprocessing steps to remove epochs with significant noise associated eye blinks.

3.2.4 Tasks

Participants were presented with eight recognition tasks, one task for each category in the dataset (see Table 1). All stimuli presented during each task were categorized as one of two classes: target or non-target. For example, for the task “smile”, participants were shown faces that either were smiling or not smiling, and were instructed to make a mental note when they saw a face that appeared to be smiling.

29 Figure 8: Diagram of brainsourcing steps. 1. An example of a recognition task (in this case, task “smile”), with preparatory prompt, masking image, sample stimuli, and ending prompt. Data is collected in one minute segments, where the subject is shown approximately 70 stimuli spaced 500ms apart. 2. Classifiers are trained indi- vidually for each subject. 3. EEG data from new stimuli are classified using these models. 4. Predictions from separate models are combined to produce a brain- sourced estimation of class probability, which is used to determine the consensus label of the new stimuli.

To ensure that the participants understood the task, each recognition task was preceded by a demonstration task, where the participants were shown four example stimuli images and they were asked to manually select the target images. These images were not used as stimuli in the actual task. The recognition task consisted of showing the participants twenty stimuli of the target class and fifty stimuli of the non-target class in rapid serial visual presentation format. To ensure enough data was collected for each participant, the recognition task and demonstration task for each of the eight image categories were repeated a total of four times, for a grand total of 32 iterations.

3.2.5 EEG Preprocessing

Recall that, due to the sensitivity of the measuring equipment, data collected via EEG are susceptible to a variety of different sources of signal noise, such as large changes in measured scalp voltage caused by movements of the participant, or less obvious sources of external electrical interference, such as those induced by electri- cal systems in the room or building where the EEG is being recorded. Standard

30 signal cleaning procedures were used to maximize the signal-to-noise ratio of the data [58]. Furthermore, the preprocessing pipeline was designed such that it could feesibly reduce the signal noise in an on-line application by only using simple, easily automated operations to clean the signals. First, a band-pass filter was applied at the frequency range 0.2-35 Hz, which mostly removed slow signal fluctuations (e.g. from respiration) and high-frequency noise (e.g. from power lines and electrical systems) from the data. After filtering, the data were split to time-locked epochs ranging from -200 to 900 ms relative to stimulus onset. Baseline-correction was applied to each epoch using its pre-stimulus period (-200-0ms). Transient artefacts such as eye blinks were removed using a threshold-based heuristic; calibration periods consisting of the first 2000 epochs of each participant were used to determine a maximum per-participant voltage thresh- old, which in turn was used to identify contaminated epochs. This threshold was set at the 90th percentile in the distribution of epochs’ maximum voltages in the cali- bration period, and capped such that voltages were at least 10µV and at most 80µV . Data with epochs where the maximum voltage exceeded the threshold were removed, which led to the removal of approximately 11% of each participants’ epochs. The final dataset consisted of on average 3251 epochs per participant.

3.2.6 Performance Measures

To evaluate the performance of individual classifiers, the brainsourcing models, and the matrix factorization predictions, the performance measures precision, recall, and F1 score were selected. These measures are popular particularly within the informa- tion retrieval community [148, 149, 150] and were selected for a variety of reasons. First, the classification problem presented in this experiment is treated as an information retrieval task. In this context, good classification performance for tar- get (positive) classes is prioritized. In other words, it is assumed that the classifier missing a positive target is less harmful than to produce a false positive. For ex- ample, given a task to recognize males, performance on correctly recognized males is valued higher than missing a male target, as repeating the task on a large crowd could be used to correct for missed targets. False positives, on the other hand, may not necessarily be corrected in the same manner. Second, the EEG data used to train individual LDA classifiers is unbalanced, containing approximately 30% target stimuli and 70% non-target stimuli. Given the unbalanced nature of the data, using a simple measure of accuracy to asses model performance would produce misleading results; a model that predicts everything to be non-target stimuli would yield an accuracy of around 70%. Due to how the F1 score relies upon precision and recall to assess performance (as opposite to accuracy), it is a suitable method for assessing the performance of models trained on unbalanced data [150]. While the dataset used for the brainsourcing model predictions was artifically balanced using downsampling, we still wanted to quantify the precision / recall tradeoff, which the F1 score captures.

31 3.3 Experiments

In this section we discuss in detail the experiments used to answer the research questions from Section 3.1. We explain how the EEG data were featurized, how the individual classification models were trained and evaluated, and the approach we used to apply matrix factorization to the classification model results. We also provide additional explanation as to why certain methods were selected.

3.3.1 Single-Trial ERP Classification Experiment

In order to answer research question 1, classifiers were trained for each participant and their results was assessed to determine if they performed better than a random baseline. For each participant and for each task, a regularized Linear Discriminant Analysis (LDA) classifier was used with automatic selection of shrinkage using the Ledoit-Wolf lemma [151] to predict binary labels (target or non-target). The pro- cedure used in this experiment is drawn from a similar procedure used in Blankertz et al. (2011) [16]. Each classifier was trained with vectorized representations of the ERPs along with a binary label indicating class membership. The label indicated whether the ERP was associated with a target stimulus or not. For example, in the case of the “blond” task, where the participant was asked to take a mental note of blond persons, the label would indicate whether an ERP was associated with a blond person or not. The vector representation consisted of spacial and temporal features, with time points in the 50 - 800ms post-stimuli time range and using all 32 channels. This resulted in a data tensor Xn×c×t with n epochs, c = 32 channels, and t = 125 time points. To reduce the data dimensionality and speed up the training procedure, the time points were split to t’ = 15 equidistant time windows, for which average voltages were computed. Finally, the temporal and spatial dimensions of the tensor were concatenated to produce a data matrix Xn×ct0 . Each classifier predicted the class probabilities for one task and used the other tasks as training data. Since there was overlap in the stimuli used for a task and its inverse, data from the task’s inverse were omitted from the training dataset. For example, when training a classifier for the “young” task, data from all other tasks except young and old were used. By leaving out the inverse task, we ensure that the classifier only predicts the class probabilities for ERPs of unseen stimuli, rather than learning other latent features that may be present in the data. This technique also allowed us to assign class probabilities for all stimuli presented to each subject that remained after the preprocessing step, with an average of 250 stimuli per participant, per task. Predicted class probabilites were converted to binary class labels using a simple threshold technique. For each classifier, the mean class probability for all of its predicted outputs was computed. Class probabilities that were greater than the mean indicated a target, while those that were less than the mean indicated a non- target. These binary labels were then used to assess classifier performance, where

32 ground truth labels for the stimuli were compared to the predicted class labels.

3.3.2 Brainsourcing Experiment

Research questions 2 and 3 concern whether or not “Brainsourcing” is possible. First, it must be determined whether or not it is possible to achieve classification results that are better than those achieved using individual models alone, by combining predictions from individual models. Second, if such improvements are possible, we should determine at what point does using additional participants fail to improve the results significantly. The brainsourcing methodology, depicted graphically in the lower part of Figure 8, consists of taking class predictions produced by the individual classification models and averaging them to produce a crowd consensus label. The brainsourcing methodology is described in detail below.

Brainsourcing Setup Predictions are based on the estimated class probabilities produced by the individually trained LDA classifiers. These class probabilities are stored in a probability matrix AX×Y for each task, where X denotes the participants and Y denotes the stimuli. Finally, crowd decisions were used to infer a label for a given stimulus. Estimations for a given stimulus x are selected from N randomly chosen participants in Y , and then the mean of these estimations is computed to produce the crowdsourced estimation.

Brainsourcing Model From the probability matrix A, 100 unique datasets were created by drawing submatrices of A with all participants and a selection of stimuli. The datasets were randomly downsampled in a manner which ensured each dataset was distinct and contained predictions for an equal number of target and non-target stimuli. For the brainsourcing step, a downsampled dataset was selected at random. Next, a stimulus column was chosen at random. From this, N random datapoints were sampled, each containing a class prediction for the selected stimulus. The mean of these individual predictions was computed and stored as the consensus probability. The selected stimulus column was then dropped from the downsampled dataset, and the sampling procedure was repeated until 250 brainsourced estimations were produced. To determine the class labels, the same thresholding procedure used to deter- mine binary labels for individual classification models was also used. The global mean for the entire iteration was computed, and consensus probabilities greater than the mean indicated a target, while those falling below the mean were indicated a non-target. After computing the binary labels, the brainsourcing procedure was repeated, until a total of 100 iterations and 250 brainsourcing estimations per iter- ation were produced. The results of each iteration were compared with the ground truth label to evaluate performance of brainsourcing. The brainsourcing steps were conducted in this manner to simulate many unique cohorts of individuals collaborating to produce crowd consensus labels for a set of

33 stimuli. Additionally, this technique further reduces the likelihood of introducing bias in our results (such as by selecting individuals who consistently perform well on the individual classification tasks).

3.3.3 Collaborative Filtering Experiment

Research question 4 concerns the use of collaborative filtering applied to the pre- dictions produced by the LDA classifiers. We applied singular value decomposition (SVD) to the results from the single-trial ERP classification experiment to answer this question. SVD was selected for its relative simplicity yet robustness, with proven performance in collaborative filtering tasks [142]. Matrix factorization techniques have also been successfully used to improve the quality of crowdsourced labels [152]. As this experiment was a proof-of-concept designed to determine whether collabo- rative filtering in a BCI context may warrant further investigation, more advanced approaches were left unexplored.

Collaborative Filtering Setup The input data for the collaborative filtering approach is similar to the data used in the brainsourcing experiment, except rather than using class probabilities, binarized class labels are used. Using a grand average for all outputs produced by the LDA classifiers, any predicted class probability above the mean was assigned a categorical label of ‘1’ to indicate target, while all predicted class probabilities below or equal to the mean were assigned the categorical label of ‘0’ to indicate non-target. This resulted in a matrix AX×Y for each task, where X denotes the participants and Y denotes the stimuli. Using these task-specific result matrices, 16 unique matrices (henceforth referred to as SVD datasets) were constructed based on groups of four tasks, omitting the inverses of each task (recall that target stimuli for a given task may be used as non-target stimuli for the task’s inverse). The SVD model itself was parameterized to preserve 5 factors (i.e. eigenvalues) and trained with stochastic gradient descent for 100 epochs. The remaining param- eters were left to their defaults. The remaining details on the SVD implementation used can be found in the documentation for the Surprise Python library2.

Collaborative Filtering Evaluation Procedure Using SVD as our matrix fac- torization method of choice, the following steps were performed to assess its validity as a predictive model for EEG-derived class labels: For sparsity level S = [0.70, 0.80, 0.90, 0.95, 0.98, 0.99]:

1. Select one of the 16 SVD datasets produced in Section 3.3.3.

2. Randomly remove observations until the total sparsity of the selected dataset is equal to S.

2https://surprise.readthedocs.io/en/stable/matrix_factorization.html

34 3. Train an SVD model using 80% of the data as examples. Predict values for the remaining 20% of the data and compare predictions with the ground truth to calculate F1 score.

4. Randomly permute the predicted labels from Step 3 and calculate F1 score, comparing to the F1 score calculated in Step 3. Repeat 1000 times to estimate a P value.

5. Repeat steps 2, 3, and 4 a total of 100 times, such that there are 100 sets of SVD predictions for each unique dataset and a P value determined via random label permutation.

6. Select another SVD dataset and repeat steps 2 through 5.

7. Repeat steps 1 through 6 for each sparsity level S.

35 4 Results

In this section, we describe in detail the results from the neurophysiological ex- periment, performance of the per-participant classification models, performance of the brainsourcing models, and performance of the collaborative filtering techniques applied to the dataset.

4.1 Neurophysiological Results

ERPs for target and non-target stimuli were averaged across all participants and analyzed by channel. Large differences between grand average scalp voltages between target and non-target classes were observed at the Cp1, Cp2, Cz, P3, P4, and Pz sites, with the largest visible difference in average voltages found at the Pz site. This difference, and grand average scalp voltages represented as topographical plots, are provided in Figure 9. Target images are associated with a strong scalp positivity, beginning at around 200 ms, peaking at ca. 280 ms, and continuing until approximately 600 ms. This suggests target detections in general amplified the P300 component to images, con- forming to the known psychophysiology literature [153, 154]. Additional cranial topographical, depicting changes in scalp voltage averaged in 100ms bands from 0ms (stimulus onset) to 500ms (onset of the next stimulus), can be found in Figure 10.

Figure 9: Grand average voltages of each ERP-component for target and non-target stimuli and cranial topomap for average scalp voltages from 250 ms to 800 ms, for the Cz (left) and Pz (right) channels. On the cranial topomap, the channel of interest is circled in green.

.

36 Figure 10: Grand average-based topographic scalp plots of ERPs from 0-100 ms, 100-200 ms, 300-400 ms, and 400-500 ms after stimulus onset. Top: ERPs associated with non-target stimuli. Bottom: ERPs associated with target stimuli minus ERPs associated with non-target stimuli.

4.2 Per-Participant Classification Results

Per-participant classifiers were tested against a random baseline, computed using label permutation [155, 156]. F1 scores for single participants averaged at 0.67 across all tasks, with task “old” achieving the lowest F1 score at 0.63, and task “female” achieving the highest at 0.74. The classifiers of all of the participants performed better than the random baseline (p = 0.01), indicating that the classifiers had learned meaningful structure from the data. The results in the per-participant classification experiment, by task, are shown in Table 2.

Task N = 1 PR F1 Blond 0.77 0.64 0.70 Female 0.81 0.68 0.74 Young 0.72 0.59 0.65 Smiling 0.75 0.59 0.65 Dark-haired 0.76 0.60 0.67 Male 0.71 0.62 0.69 Old 0.71 0.57 0.63 Not smiling 0.74 0.59 0.66 Mean 0.75 0.61 0.67

Table 2: Mean precision, recall, and F1 scores for all tasks, for per-participant LDA models. Precision was consistently higher than recall, and overall performance for each task was significantly better than random.

While all individual models performed significantly better than a random base-

37 Figure 11: F1 scores for overall task performance, by participant. Results for all par- ticipants were significantly better than those achieved via random label permutation (p<0.001). line, the performance of each model (see Figure 11) also varied significantly from each other. The best performing model achieved a mean F1 score of 0.87 and the worst performing model achieved a mean F1 score of 0.72.

4.3 Brainsourcing Results

Table 3 shows the classifier performance for the brainsourcing experiment. This performance, with random baseline computed using permuted labels, is depicted in Figure 13. As additional participants were added, performance significantly improved, with the largest incremental improvement occurring at N = 2, with an average F1 score of 0.77 and improvement of 0.10 over N = 1. Significant performance improvements continued through N = 12, with an average F1 score of 0.94 across all tasks. Task “old” experienced the largest improvement in performance, with an F1 score of 0.98 at N = 12 and a ∆N = 1 of 0.35. Conversely, task “young” performance improved the least, with an F1 score of 0.65 for N = 1, compared with 0.86 at N = 12, for a ∆N = 1 of 0.21. Throughout the experiment, precision was significantly higher than recall. These differences were most pronounced with lower numbers of participants (N < 7), how- ever they become less pronounced as more participants were used (N ≥ 8), where precision and recall begin to stabilize for most tasks.

38 Using the Wilcoxon signed-rank test, results estimated using different numbers of participants N were compared to each other to determine if the differences in estimations were statistically significant. Estimations calculated using N ≥ 2 par- ticipants were all significantly better than those from a single participant N = 1. For values of N ≥ 2, statistically significant differences required a ∆N ≥ 4. That is, to achieve statistically significant better brainsourcing results compared to those achieved from two participants, at least four more participants must be added to the brainsourcing model, for a total of six participants used in calculating the estma- tions. A matrix depicting the significance results can be found in the lower right of Figure 13. A representative sample of labeling outputs with corresponding stimuli and varying N is shown in Figure 12, where the performance can also be seen to improve significantly as the number of participants increases.

4.3.1 Characteristics of Brainsourcing Output

The stimuli used in this experiment were produced by sampling from a multivariate normal distribution. By grouping the stimuli into binary categories containing a target class and its inverse, we expected that this would lead to a bimodal distri- bution. Intuition suggests that a properly functioning brainsourcing model should output predictions that also follow a bimodal distribution. Histograms in Figure 14 produced from the results of the brainsourcing model reveal the predictions indeed follow a bimodal distribution, corresponding with the nature of the stimuli dataset. This bimodal property becomes more pronounced when the number of participants used to create the brainsourced estimation increases. The results indicate that as the size of the crowd increases, the confidence of the classifier predictions improve as they start to follow the distribution of the ground truth data more closely. This suggests that the improved empirical performance of brainsourcing can be attributed to increased confidence of the brainsourcing model.

Figure 12: Representative sample of correct and incorrect labelings for target and non-target stimuli within the task "blond", by number of participants used to esti- mate the true label. Brainsourcing classification performance continues to improve as particpants N increases. Performance at N > 9 achieves an F1 score of 0.90.

39 Task N = 1 N = 2 N = 4 N = 6 N = 12 PR F1 PR F1 ∆N=1 P R F1 ∆N=1 P R F1 ∆N=1 P R F1 ∆N=1 Blond 0.77 0.64 0.70 0.84 0.75 0.79 0.09 0.91 0.84 0.87 0.17 0.95 0.89 0.92 0.22 0.98 0.95 0.97 0.27 Female 0.81 0.68 0.74 0.85 0.82 0.83 0.09 0.92 0.89 0.91 0.17 0.94 0.93 0.93 0.19 0.97 0.99 0.98 0.24 Young 0.72 0.59 0.65 0.77 0.71 0.74 0.09 0.82 0.76 0.79 0.14 0.85 0.78 0.82 0.17 0.85 0.87 0.86 0.21 Smiling 0.75 0.59 0.65 0.81 0.75 0.78 0.13 0.89 0.83 0.86 0.19 0.93 0.86 0.89 0.22 0.97 0.87 0.92 0.27 Dark-haired 0.76 0.60 0.67 0.81 0.74 0.78 0.11 0.90 0.82 0.86 0.19 0.94 0.86 0.89 0.22 0.98 0.92 0.95 0.28

40 Male 0.71 0.62 0.69 0.82 0.72 0.77 0.08 0.90 0.79 0.84 0.15 0.93 0.82 0.87 0.18 0.97 0.86 0.91 0.22 Old 0.71 0.57 0.63 0.76 0.69 0.72 0.09 0.82 0.69 0.76 0.13 0.89 0.75 0.81 0.18 0.96 0.99 0.98 0.35 Not smiling 0.74 0.59 0.66 0.79 0.74 0.76 0.10 0.87 0.78 0.82 0.16 0.91 0.87 0.88 0.22 0.97 0.94 0.96 0.30 Mean 0.75 0.61 0.67 0.81 0.74 0.77 0.10 0.88 0.80 0.82 0.15 0.92 0.85 0.88 0.21 0.96 0.92 0.94 0.27

Table 3: Precision, recall, F1 score, and the improvement of the F1 score with respect to N = 1 for target task, given N participants used in the brainsourcing estimation. All ∆N = 1 are statistically significant p ≤ 0.0001. Performance for each task improved dramatically by increasing the number of participants used to estimate class labels for stimuli. . Figure 13: Distribution of mean predictions produced from the brainsourcing model. A brainsourcing prediction above the mean indicates that a given stimulus is more likely to be of the target class. Conversely, a brainsourcing prediction below the mean indicates that a given stimulus is more likely a non-target stimulus. The results accurately reflect the bimodal nature of the stimuli data. The distribution becomes increasingly distinct as more participants are used to estimate class labels.

4.3.2 Analysis of Sparsity and Crowd Size

Due to the study design, it was not practical to show all participants the same stimuli. This presented an upper limit to how many brainsourcing “workers” we could utilize in our estimations. Most stimuli for a given task were viewed by at least two participants, with the total sparsity of the cleaned data averaging 60%. Less than 3% of stimuli were viewed only by a single participant. These stimuli were equally distributed between target and non-target classes and did not significantly affect model performance. After accounting for observations lost due to eye blinks

41 and other sources of noise, an average of 11 participants viewed any given stimulus. To avoid introducing bias into the results, we selected 12 participants as our upper limit for the number of brainsourcing “workers”.

Figure 14: Distribution of mean predictions produced from the brainsourcing model. A brainsourcing prediction above the mean indicates that a given stimulus is more likely to be of the target class. The results correctly show the binomial distribution of the classified stimuli, and this binomial distribution becomes increasingly clear as data from more participants are used to estimate class labels.

42 4.4 Matrix Factorization Results

The matrix factorization approach was expected to perform better than random for lower sparsity levels, and for performance to decline as sparsity increased. These expectations were reflected in the results. In contrast with the LDA and brainsourc- ing classification models which predicted the “true” class of a stimulus, the matrix factorization approach attempted to predict the user’s response to a stimulus as characterized by the LDA classifier. Thus, it was also be expected that the task- wise performance of the matrix factorization may be influenced by the quality of the LDA classifier outputs. F1 scores for the SVD model did depend significantly on the sparsity of the dataset, with best overall performance achieved at 70% sparsity with a mean F1 score of 0.67, and the worst overall performance achieved at 99% sparsity with a mean F1 score of 0.54. At 70% sparsity, the best performance was for task “male” (0.71) and the worst for tasks “young” and “old” (0.63). The results also seem to further reflect the inadequacies of the individual LDA classifiers, as tasks that achieved poor classification results in the LDA classifiers also achieved poor results in the SVD model. Boxplots by task and sparsity level are shown in Figure 15. Results by task and sparsity are provided in Table 4.

Figure 15: Left: SVD performance by sparsity level. Performance decreases at greater levels of sparsity. For sparsity levels of 0.98 and above, it could not be proven that the SVD model performed better than random. Right: SVD performance by task, at 70% sparsity. Performance for tasks “smile” and “not smiling” had particu- larly low variance with respect to other tasks and could not be attributed to random chance.

43 aa aa Sparsity aa 70% 80% 90% 95% 98% 99% Task a aa Blond 0.67** 0.65** 0.64*** 0.60* 0.54 0.55 Female 0.69*** 0.68*** 0.64*** 0.61* 0.56 0.56 Young 0.63** 0.61* 0.59* 0.58* 0.54 0.52 Smiling 0.66*** 0.64*** 0.61** 0.59* 0.53 0.53 Dark-haired 0.64* 0.62* 0.61* 0.59 0.52 0.53 Male 0.71*** 0.71*** 0.65*** 0.62** 0.56 0.53 Old 0.63** 0.61* 0.60 0.58 0.54 0.54 Not smiling 0.67*** 0.65*** 0.63*** 0.60** 0.53 0.54 Mean 0.67 0.64 0.62 0.60 0.55 0.54 Significance codes: ***p < 0.001, **p < 0.01,*p < 0.05 Table 4: Mean F1 scores for all tasks, by sparsity level. At 90% sparsity, all results performed better than random except for task “old”. Results for 98% sparsity and above could not be determined to perform better than random for any task. .

Individual results for 70% sparsity, shown in Figure 16, ranged from a maximum mean F1 score of 0.75 to a minimum mean F1 score of 0.57, with the results of 29 out of 31 participants being significantly better than random.3 At greater sparsity levels, performance dropped, as did statistical significance. At 90% sparsity, the results of 15 out of 31 participants were significantly better than random. Results from at sparsity levels of 98% and above were not statistically significant.

Figure 16: SVD performance by participant at 70% sparsity.

3Due to an oversight in the implementation of the subject anonymization script, subject numbers in Figure 16 do not correspond to subject numbers from Figure 11

44 5 Discussion

In this section, we will discuss the results presented in Section 4. We will consider their implications in a broader context, discuss limitations of the current work, and explore how this work may be extended in future work. Additionally, we consider the broader social and ethical implications of brainsourcing and consider how it may be misused.

5.1 Summary of Contributions

In order to study whether brainsourcing is possible and how it performs, we asked four research questions. Here, we discuss the results accordingly.

RQ1: Can regularized LDA be used to classify EEG signals collected within a binary class framework from a facial recognition task? A1: our results show that EEG signals collected during a facial recognition task can be successfully classified into their binary labels using a simple regularized LDA model. RQ2: Can EEG data collected across many individuals be combined in a crowd- sourcing setting to predict a class label for a given stimulus, such that the collective performance is better than the performance of any individual user? A2: Our results show that human brain responses can be utilized to collectively assign class labels for stimuli, and that the average performance of a crowd consisting of only two participants significantly out-performs results achieved using the average participant model. RQ3: How many participants are enough to reliably predict a stimulus class, and at what point does the addition of participants fail to significantly improve the results? A3: Brainsourcing demonstrates increasing performance as a function of the crowd size; our results indicate significant improvements were achieved when the crowd size increased by at least three participants, with performance peaking at the maximum crowd size of 12 for all measures used. Our results suggest that even small crowds can be used in a brainsourcing setting to produce high quality output. RQ4: Can collaborative filtering using matrix factorization on the individual clas- sifier outputs produce reasonable results such that it could be used to make class predictions for unseen stimuli for a given individual? A4: Matrix factorization techniques show promise as a technique to make predic- tions for data that may be otherwise lost due to noise or other artifacts. Currently, the technique applied to our data suggests that it is somewhat sensitive to sparsity, and does not perform well under sparsity conditions similarly found in other collab- orative filtering contexts. Our results show that it is worth investigating further the

45 possibility that, as in content recommendation systems, BCI systems may be able to segment users into clusters based on general patterns in their neurological data.

5.2 Limitations

While valuable as a proof-of-concept demonstration, brainsourcing as presented in this thesis does suffer from several limitations. Currently, this approach exploits the use of binary class datasets, rather than a multiclass dataset, and the stimuli used were selected so they could easily be separated as target or non-target. Edge cases, such as androgynous-appearing faces for the male/female tasks, or light brown hair for the blond/dark-haired task, have been left unexplored. Additionally, we only investigated strict recognition tasks and relied on well-known BCI methodology [16]. More complex tasks, such as subjective preference assignment, are likely to prove more challenging for this approach. In order to effectively utilize the oddball paradigm, the stimulus dataset was structured in such a way so that during the experiment, approximately 70% of the stimuli presented were of the non-target class, and the remaining 30% were of the target class. ERPs generated from frequent target stimuli are more difficult to distinguish due to ERP components being selectively evoked by infrequent stimuli [78, 157, 82]. Effectively balanced data to ensure significantly more non-target than target stimuli cannot be guaranteed in real-world datasets, although this balancing could be partially achieved by randomly mixing already classified non-target stimuli into a new dataset. Due to eye blinks, electrical interference, and other non-preventable distur- bances, approximately 10% of the data collected were ultimately discarded. While discarding data is undesirable, the amount of stimuli that can be shown to a par- ticipant in a single session (approximately 1 stimulus/500 ms) compensates for the inevitable loss of data incurred by noise, artifacts, and subject fatigue. Brainsourc- ing achieves results using relatively few participants and thus can tolerate data loss due to artifacts. Another significant limitation of our work with respect to practical applica- tions are the devices used, setup time, and cost. Participants were connected to a 32-channel EEG system, which required approximately 20 minutes of physical preparatory work by a trained laboratory technician, followed by an additional 30 minutes to complete the recognition tasks and another 10 minutes of disassem- bly and clean-up. EEG systems are continually developing, with current research evolving to adopt techniques that drastically reduce setup time (e.g. as with flex- ible silicone technologies or active electrode amplifiers [158]). Innovations within EEG equipment [159] and other brain imaging technologies such as functional near- infrared spectroscopy [160] and magnetoencephalography [161] are likely to push the boundaries of what brainsourcing can achieve. It is also important to note that while our approach is collaborative, it relies on personalized models for each individual. It is not yet feasible to simply take the

46 EEG signals from all participants and train a single model using these signals. While some EEG potentials can be localized to common regions of the brain, individual differences in an array of physiological variables, such as the orientations of dipole sources, skull thickness [162, 163], skin resistance, etc. mean that EEG is specific to the individual and even the recording setting. Therefore, the differences between two or more subjects are likely too large for such a model to perform well. The individual classification models currently enable us to avoid this problem. From a machine learning perspective, this work is limited by the use of simple models. Rather than using state-of-the-art approaches, it was decided that the success of less complex, potentially worse-performing models would provide stronger argumentative support when demonstrating a proof-of-concept. Ways in which this research could be extended are discussed in Section 5.3. While applying matrix factorization techniques to the individual classifier out- puts achieved performance better than random, it is not entirely clear if the model is learning patterns related to an individual’s personal responses or from patterns already inherent in the data. This issue must be more fully explored using EEG data from a subjective preference experiment; successfully applying collaborative filtering techniques to EEG data collected in response to subjective, rather than objective, classification problems would be a more robust demonstration of its feasibility.

5.3 Future Work

While brainsourcing yields promising results, it could be improved using a variety of techniques. More sophisticated machine learning approaches, which could better learn the structural differences between the different ERP classes, could produce bet- ter performance outcomes, particularly when the number of available crowdworkers is low (N<5). The current method involves taking the mean of multiple classifier outputs and assigning a label based on a simple threshold value; performance may improve significantly by using a Bayesian approach. Currently the proposed approach only demonstrates the feasibility of using a supervised approach to train models for purposes of stimuli classification. Using unsupervised methods to cluster stimuli into their likely classes also warrants further exploration. The approach could also be extended to translate EEG signals from separate individuals across a common set of stimuli into a common latent space. This would enable us to combine translated EEG data or train models for all participants per task, rather than per participant, per task. The probabilistic methods proposed in [131] could be explored within the context of brainsourcing. Since the participants were explicitly asked to recognize the features in the im- ages, according to the crowdsourcing categorization in [32], the experiment detailed in this thesis falls under explicit crowdsourcing. That is, the participants were in- structed to perform an explicit task. Future research could apply brainsourcing to an implicit crowdsourcing setting, such as through the recording of brain signals associated with completing a gamified microtask. The task instructions in our ex-

47 periment were to make a mental note when a target was observed - an approach based on completely passive reactions, such that subjects are presented with stimuli in a wide variety of classes with the only instruction to simply observe, is worth exploring as well. The P300 evoked potential utilized in our implementation of brainsourcing is produced when a stimulus matches an expected mental target [164]. In this thesis, we used the P300 evoked by images matching the target class to assist in training our model, yet the P300 does not require a specific stimulus type or modality. Indeed, the P300 has been shown to be produced by target stimuli presented through a variety of sensory modalities, including auditory [165], haptic [166], and olfactory [167]. Thus, brainsourcing paradigm could, in theory, be applied to classify data otherwise not easily accessible to a computer system, such as the smell or taste of a food or the feeling of a touch. While the P300 is highly task-relevant [164], it only allows the solving of explicit microtasks; participants must be aware of the task in order to elicit a P300 strong enough to be utilized in a classification setting. However, it should be possible to also leverage brain potentials that occur without explicit conscious control. The N170, for example, could be targeted in a face detection task (rather than distinguishing between two categories of faces, as presented in this thesis). Another implicitly valuable potential is within the N400, which has been asso- ciated with the violation of expectation in linguistic contexts [168]. This attribute could be utilized by brainsourcing to detect semantical incongruencies within text or speech by instructing participants to simply read text or listen to audio presented to them while recording their brain activity. In this way, the participants would not be explicitly solving a crowdsourced task, but would accomplish this as a side product of reading text. Yet another line of research would be to extend the tasks beyond simple recogni- tion and classification, and restructure them so brainsourcing could allow the crowd and the computers to detect socially cohesive opinions, even detecting crowds with contradicting views or internally dissonant opinions. Our current results depict greater observed variance among specific tasks and disparate improvements in per- formance as more brainsourcing participants were added to the model. The largest task-wise performance increase was in task “Old” and the smallest improvement was in task “Young”. One possible explanation is that these tasks were likely the most subject to individual interpretation. The definition of an old or a young person may heavily depend on the participant’s age, culture, and gender. This suggests that brainsourcing may have potential in unlocking subjective, implicit cohesion and diversion of opinions, emotions, and even attitudes [169, 170, 171, 172]. From the perspective of collaborative filtering, SVD is one of many techniques that may be used. While SVD has achieved competitive results, new approaches have been introduced since SVD-based methods became popularized thanks to the Netflix prize in 2007-2009, such as systems that utilize architectures [173]. This thesis only explores a single technique among many that have proven

48 effective. A significant challenge in collaborative filtering approaches, within the context of EEG studies, is relative the lack of data. Typically, collaborative filtering datasets, while much more sparse than the data used in this thesis, contain information per- taining to hundreds of thousands to millions of users, reducing the negative effect of the high sparsity levels. Unfortunately, construction of such datasets is infeasible within a laboratory setting. Until the introduction of ubiquitous BCIs, applying collaborative filtering techniques to a much larger EEG-derived dataset is not yet a possibility. Nonethless, using data from cheaper consumer-grade EEG headsets that offer much lower spatial resolution may present a unique challenge, provided that data has been collected from a much larger population. Achieving similar col- laborative filtering results with data from far less sophisticated equipment would nonetheless be a significant contribution.

5.4 Social and Ethical Concerns

Like many digital technologies, BCI systems can also be misused, applied unethi- cally, or compromised by malicious agents. As modern societies become increasingly digitized and the technologies from which BCI systems are based continue to im- prove, BCI systems are more likely to become as commonplace as the keyboard, mouse, and touchscreen in 2019. The nature of BCI applications leaves them particularly vulnerable to exploita- tion. The data collected via EEG can be used to infer neurological conditions, individual preferences, and behavioral patterns that might otherwise go undetected by the typical person. It is important to explore the potential ramifications of widespread adoption of BCI systems, as well as the potential ways in which brain- sourcing might be used unethically.

5.4.1 BCI and Personal Privacy

Owing to the private nature of EEG data, ensuring that an individual’s raw EEG data is protected from external sources is a primary concern when designing BCI sys- tems. EEG data has been used to diagnose a variety of neurological conditions, and can also be used to infer risk of developing certain neurological problems. Such infor- mation could be exploited by insurance companies to deny coverage to prospective customers without their knowledge, or by employers interested in hiring employ- ees with fewer existing medical conditions in order to reduce the risk of incurring financial losses caused by less healthy (and potentially less productive) employees.

5.4.2 Consequences of Ubiquitous BCI

In a society where BCI applications are as common as current-day smart phones, novel ways in which these technologies can be used unethically may emerge. Gov-

49 ernments, corporations, and criminal organizations may use these systems in ways that violate individual rights to privacy or autonomy. New labor paradigms may emerge that exploit the convenience of these systems and the widespread availability of brainsourcing workers. While frameworks to facilitate labor rights among crowd- workers have been proposed, such as Turkopticon [174], these must be adapted to account for the unique challenges BCI-based labor introduces. In this section, we expand upon existing work involving crowdworker rights and design fictions within the context of ubiquitous BCI adoption [175, 176]. We discuss how the techniques presented in this paper may be used or extended in a manner such that their appli- cations are deemed unethical or otherwise immoral.

Subliminal probing It has been demonstrated that private information (specif- ically, recognition of a face) can be obtained from individuals using a BCI system without their knowledge or consent, through the use of subliminal probing techniques [177]. Such techniques may be extended or modified in a manner such that they could expose PIN codes, passwords, and private social relationships. Corporations or political campaigns could secretly probe consumer or voter attitudes to better market their products [178, 179]. Governments and corporations may be tempted to conduct BCI-driven surveillance to monitor their citizens or employees, and could enhance their findings using brainsourcing techniques.

Surveillance Systems In 2018, several articles were published [180, 181] doc- umenting the use of brain monitoring systems by Chinese businesses, namely at Hangzhou Zhongheng Electric and State Grid Zhejiang Electric Power, where the technology was first introduced in 2014. The technology has since become widespread in China, deployed in a variety of environments and working conditions. As of 2018, Changhai Hospital in Shanghai was in active collaboration with Fudan University to develop similar brain monitoring devices to monitor patient emotions and mental states in order to prevent potentially violent incidents. Such implementations, while not directly BCI systems by themselves, exploit advancements in the BCI field. The ethics of these systems is highly suspect, as it is not difficult to imagine scenarios in which the systems violate fundamental human rights for personal autonomy and privacy.

Medical fingerprinting and privacy EEG data should be considered personal medical data; protecting it becomes particularly important as EEG data can be used to diagnose neurological conditions [182, 43]. EEG data can also be used as a biometric identifier [183, 184], and thus could be used to identify an otherwise anonymous user if similar data is made publicly available (e.g., via a security breach in a BCI application that collects EEG data). Privacy-preserving techniques for BCI applications need to be designed to protect the raw user data collected such that it cannot be obtained by third parties without the individual’s explicit knowledge and consent. Furthermore, models produced using this sensitive information should be

50 safeguarded so that the data used to build them cannot be reverse-engineered [185].

Abuse of microtask labor There is risk of abuse of this system through the establishment of businesses and organizations that deploy it in an unethical manner. Brainsourcing can be conducted using healthy individuals with minimal training - a typical worker can be trained in under 15 minutes. Companies that wish to perform brainsourcing for financial gain may have reduced incentives to maintain the mental and physical welfare of their employees, who require little training and are thus more likely to be treated as expendable commodities. Widespread adoption of consumer-level BCI systems could result in brainsourcing tasks being completed by the average computer user; under such circumstances, brainsourcing tasks could serve as an alternative to existing online monetization methods. Such an approach might have negative consequences when combined with subliminal probing, where users are at risk of unwittingly consenting to their sensitive data being transmitted and utilized.

6 Conclusion

The objective of our research was to study the feasibility of using EEG to classify binary stimuli and to extend this application to a crowdsourcing setting. We also investigated how collaborative filtering techniques could be applied to the classifier results. Previous work has relied upon categorizing stimuli of distinct categories (inan- imate object vs. person) [186], rather than of the same metacategory (faces) with varying salient features (hair color, apparent age, smiling or not). We extended findings from previous work and demonstrated that such binary classification with EEG data is possible even when stimuli are within the same metacategory. We presented brainsourcing: a novel crowdsourcing paradigm that utilizes brain responses of a group of human contributors each performing a recognition task to determine classes of stimuli. An experiment utilizing brainsourcing was reported for an image labeling task. Compared to individual classifiers, significant classification performance gains were achieved using brainsourcing. It was revealed that it is indeed possible to achieve competitive classification results using relatively simple techniques of combining multiple predictions from several individuals. Performance improvements achieved using brainsourcing asymptotically approach near perfect accuracy at around 12 participant with a mean F1 score of 0.96 across all tasks. We also explored the use of collaborative filtering techniques within the context of making predictions for stimuli for which EEG data is unavailable. The results were significantly better than random and show that we predicted statistically significant responses for 29 out of 31 individuals. The results warrant further exploration to determine whether clusters have been identified within the data, or if the matrix factorization technique has learned latent information about the stimuli captured

51 from the predicted responses. Crowdsourcing via human brain signals as a paradigm shows promise. Such an approach may offer a better way to rapidly mine information from individuals. While currently limited by its poor spatial resolution, the high temporal resolution and overall greater dimensionality of EEG may allow for more information to be captured than otherwise possible through existing implicit and explicit approaches.

52 References

1 “Measuring digital development: Facts and figures 2019,” International Telecom- munication Union, 2019.

2 K. B. Sheehan and M. Pittman, Amazon’s Mechanical Turk for academics: The HIT handbook for social science research. Melvin & Leigh, Publishers, 2016.

3 O. F. Zaidan and C. Callison-Burch, “Crowdsourcing translation: Professional quality from non-professionals,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies- Volume 1, pp. 1220–1229, Association for Computational Linguistics, 2011.

4 M. Moyle, J. Tonra, and V. Wallace, “Manuscript transcription by crowdsourcing: Transcribe bentham,” Liber Quarterly, vol. 20, no. 3-4, 2011.

5 G. Jin, J. Lee, M. Luca, et al., “Aggregation of consumer ratings: an application to yelp. com,” Quantitative Marketing and Economics, vol. 16, no. 3, pp. 289–339, 2018.

6 M. Amin-Naseri, P. Chakraborty, A. Sharma, S. B. Gilbert, and M. Hong, “Eval- uating the reliability, coverage, and added value of crowdsourced traffic incident reports from waze,” Transportation research record, vol. 2672, no. 43, pp. 34–43, 2018.

7 D. Kelly and J. Teevan, “Implicit feedback for inferring user preference: a bib- liography,” in Acm Sigir Forum, vol. 37, pp. 18–28, ACM New York, NY, USA, 2003.

8 M. Melucci and R. W. White, “Discovering hidden contextual factors for implicit feedback,” in Held in conjunction with the 6th International and Interdisciplinary Conference on Modeling and Using Context, p. 69, 2007.

9 I. Arapakis, K. Athanasakos, and J. M. Jose, “A comparison of general vs per- sonalised affective models for the prediction of topical relevance,” in Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pp. 371–378, 2010.

10 J. Salojärvi, I. Kojo, J. Simola, and S. Kaski, “Can relevance be inferred from eye movements in information retrieval,” in Proceedings of WSOM, vol. 3, pp. 261– 266, 2003.

11 O. Barral, I. Kosunen, T. Ruotsalo, M. M. Spapé, M. J. Eugster, N. Ravaja, S. Kaski, and G. Jacucci, “Extracting relevance and affect information from phys- iological text annotation,” User Modeling and User-Adapted Interaction, vol. 26, no. 5, pp. 493–520, 2016.

53 12 J. M. Rzeszotarski, E. Chi, P. Paritosh, and P. Dai, “Inserting micro-breaks into crowdsourcing workflows,” in First AAAI Conference on Human Computation and Crowdsourcing, 2013.

13 Y. Moshfeghi, L. R. Pinto, F. E. Pollick, and J. M. Jose, “Understanding rele- vance: An fmri study,” in European conference on information retrieval, pp. 14– 25, Springer, 2013.

14 M. J. Eugster, T. Ruotsalo, M. M. Spapé, I. Kosunen, O. Barral, N. Ravaja, G. Jacucci, and S. Kaski, “Predicting term-relevance from brain signals,” in Pro- ceedings of the 37th International ACM SIGIR Conference on Research & Devel- opment in Information Retrieval, SIGIR ’14, (New York, NY, USA), pp. 425–434, ACM, 2014.

15 M. J. A. Eugster, T. Ruotsalo, M. M. A. Spapé, O. Barral, N. Ravaja, G. Jacucci, and S. Kaski, “Natural brain-information interfaces: Recommending information by relevance inferred from human brain signals,” Scientific Reports, vol. 6, 2016.

16 B. Blankertz, S. Lemm, M. Treder, S. Haufe, and K.-R. Müller, “Single-trial analysis and classification of erp components - a tutorial,” NeuroImage, vol. 56, no. 2, pp. 814–825, 2011.

17 K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.

18 W. Tang and M. Lease, “Semi-supervised consensus labeling for crowdsourc- ing,” in SIGIR 2011 workshop on crowdsourcing for information retrieval (CIR), pp. 1–6, 2011.

19 P. G. Ipeirotis, F. Provost, and J. Wang, “Quality management on amazon me- chanical turk,” in Proceedings of the ACM SIGKDD workshop on human com- putation, pp. 64–67, 2010.

20 E. Real, J. Shlens, S. Mazzocchi, X. Pan, and V. Vanhoucke, “Youtube- boundingboxes: A large high-precision human-annotated data set for object de- tection in video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5296–5305, 2017.

21 D. C. Brabham, “Crowdsourcing as a model for problem solving: An introduction and cases,” Convergence: The International Journal of Research into New Media Technologies, vol. 14, no. 1, pp. 75–90, 2008.

22 M. Yuen, I. King, and K. Leung, “A survey of crowdsourcing systems,” in 2011 IEEE Third International Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third International Conference on , pp. 766–773, Oct 2011.

23 J. Howe, “The rise of crowdsourcing,” Wired Magazine, vol. 14, 06 2006.

54 24 J. Howe, Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business. New York, NY, USA: Crown Publishing Group, 1 ed., 2008.

25 G. B. Schmidt, “Fifty days an mturk worker: The social and motivational context for amazon mechanical turk workers,” Industrial and Organizational Psychology, vol. 8, no. 2, pp. 165–171, 2015.

26 E. Estellés-Arolas and F. González-Ladrón-De-Guevara, “Towards an integrated crowdsourcing definition,” J. Inf. Sci., vol. 38, pp. 189–200, Apr. 2012.

27 A. Kittur, B. Smus, S. Khamkar, and R. E. Kraut, “Crowdforge: Crowdsourcing complex work,” in Proceedings of the 24th annual ACM symposium on User interface software and technology, pp. 43–52, 2011.

28 J. Cheng, J. Teevan, S. T. Iqbal, and M. S. Bernstein, “Break it down: A com- parison of macro-and microtasks,” in Proceedings of the 33rd Annual ACM Con- ference on Human Factors in Computing Systems, pp. 4061–4064, ACM, 2015.

29 A. Kovashka, O. Russakovsky, L. Fei-Fei, and K. Grauman, “Crowdsourcing in computer vision,” Foundations and Trends in Computer Graphics and Vision, vol. 10, no. 3, pp. 177–243, 2016.

30 N. W. Kim, Z. Bylinskii, M. A. Borkin, K. Z. Gajos, A. Oliva, F. Durand, and H. Pfister, “Bubbleview: An interface for crowdsourcing image importance maps and tracking visual attention,” ACM Trans. Comput.-Hum. Interact., vol. 24, pp. 36:1–36:40, Nov. 2017.

31 J. Deng, J. Krause, and L. Fei-Fei, “Fine-grained crowdsourcing for fine-grained recognition,” in Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’13, (Washington, DC, USA), pp. 580–587, IEEE Computer Society, 2013.

32 A. Doan, R. Ramakrishnan, and A. Y. Halevy, “Crowdsourcing systems on the world-wide web,” Communications of the ACM, vol. 54, no. 4, pp. 86–96, 2011.

33 L. von Ahn and L. Dabbish, “Labeling images with a computer game,” in Pro- ceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’04, (New York, NY, USA), pp. 319–326, ACM, 2004.

34 L. von Ahn, B. Maurer, C. McMillen, D. Abraham, and M. Blum, “recaptcha: Human-based character recognition via web security measures,” Science, vol. 321, no. 5895, pp. 1465–1468, 2008.

35 C. H. Lin, E. Kamar, and E. Horvitz, “Signals in the silence: Models of implicit feedback in a recommendation system for crowdsourcing,” in Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, AAAI’14, pp. 908– 914, AAAI Press, 2014.

55 36 M. Lease and E. Yilmaz, “Crowdsourcing for information retrieval,” SIGIR Fo- rum, vol. 45, pp. 66–75, Jan. 2012.

37 X. Chen, P. N. Bennett, K. Collins-Thompson, and E. Horvitz, “Pairwise ranking aggregation in a crowdsourced setting,” in Proceedings of the Sixth ACM Inter- national Conference on Web Search and , WSDM ’13, (New York, NY, USA), pp. 193–202, ACM, 2013.

38 K. Athukorala, D. Głowacka, G. Jacucci, A. Oulasvirta, and J. Vreeken, “Is exploratory search different? a comparison of information search behavior for exploratory and lookup tasks,” Journal of the Association for Information Sci- ence and Technology, vol. 67, no. 11, pp. 2635–2651, 2016.

39 F. Khatib, F. DiMaio, S. Cooper, M. Kazmierczyk, M. Gilski, S. Krzywda, H. Zabranska, I. Pichova, J. Thompson, Z. Popović, et al., “Crystal structure of a monomeric retroviral protease solved by protein folding game players,” Na- ture structural & molecular biology, vol. 18, no. 10, pp. 1175–1177, 2011.

40 C. B. Eiben, J. B. Siegel, J. B. Bale, S. Cooper, F. Khatib, B. W. Shen, B. L. Stoddard, Z. Popovic, and D. Baker, “Increased diels-alderase activity through backbone remodeling guided by foldit players,” Nature biotechnology, vol. 30, no. 2, p. 190, 2012.

41 L. F. Haas, “Hans berger (1873–1941), richard caton (1842–1926), and electroen- cephalography,” Journal of Neurology, Neurosurgery & Psychiatry, vol. 74, no. 1, pp. 9–9, 2003.

42 A. Coenen, E. Fine, and O. Zayachkivska, “Adolf beck: a forgotten pioneer in electroencephalography,” Journal of the History of the Neurosciences, vol. 23, no. 3, pp. 276–286, 2014.

43 S. Smith, “Eeg in the diagnosis, classification, and management of patients with epilepsy,” Journal of Neurology, Neurosurgery & Psychiatry, vol. 76, no. suppl 2, pp. ii2–ii7, 2005.

44 M. L. Perlis, H. Merica, M. T. Smith, and D. E. Giles, “Beta eeg activity and insomnia,” Sleep medicine reviews, vol. 5, no. 5, pp. 365–376, 2001.

45 E. L. Merrin, T. C. Floyd, and G. Fein, “Eeg coherence in unmedicated schizophrenic patients,” Biological psychiatry, vol. 25, no. 1, pp. 60–66, 1989.

46 M. Murias, S. J. Webb, J. Greenson, and G. Dawson, “Resting state cortical connectivity reflected in eeg coherence in individuals with autism,” Biological psychiatry, vol. 62, no. 3, pp. 270–273, 2007.

47 R. Bernier, G. Dawson, S. Webb, and M. Murias, “Eeg mu rhythm and imitation impairments in individuals with autism spectrum disorder,” Brain and cognition, vol. 64, no. 3, pp. 228–237, 2007.

56 48 J.-P. Banquet, “Spectral analysis of the eeg in meditation,” Electroencephalogra- phy and clinical neurophysiology, vol. 35, no. 2, pp. 143–151, 1973.

49 E. Rodin and E. Luby, “Effects of lsd-25 on the eeg and photic evoked responses,” Archives of general psychiatry, vol. 14, no. 4, pp. 435–441, 1966.

50 R. Rynearson, M. Wilson Jr, and R. Bickford, “Psilocybin-induced changes in psychologic function, electroencephalogram, and light-evoked potentials in hu- man subjects.,” in Mayo Clinic Proceedings, vol. 43, p. 191, 1968.

51 J. Acosta-Urquidi, “Qeeg studies of the acute effects of the visionary tryptamine dmt,” Cosmos and History: The Journal of Natural and Social Philosophy, vol. 11, no. 2, pp. 115–129, 2015.

52 A. Fink, R. H. Grabner, M. Benedek, G. Reishofer, V. Hauswirth, M. Fally, C. Neuper, F. Ebner, and A. C. Neubauer, “The creative brain: Investigation of brain activity during creative problem solving by means of eeg and fmri,” Human brain mapping, vol. 30, no. 3, pp. 734–748, 2009.

53 S. Frisch, M. Schlesewsky, D. Saddy, and A. Alpermann, “The p600 as an indi- cator of syntactic ambiguity,” Cognition, vol. 85, no. 3, pp. B83–B92, 2002.

54 M. Kutas and K. D. Federmeier, “Thirty years and counting: finding meaning in the n400 component of the event-related brain potential (erp),” Annual review of psychology, vol. 62, pp. 621–647, 2011.

55 P. C. Petrantonakis and L. J. Hadjileontiadis, “Emotion recognition from eeg using higher order crossings,” IEEE Transactions on Information Technology in Biomedicine, vol. 14, no. 2, pp. 186–197, 2009.

56 Y.-P. Lin, C.-H. Wang, T.-P. Jung, T.-L. Wu, S.-K. Jeng, J.-R. Duann, and J.-H. Chen, “Eeg-based emotion recognition in music listening,” IEEE Transactions on Biomedical Engineering, vol. 57, no. 7, pp. 1798–1806, 2010.

57 H. H. Jasper, “The ten-twenty electrode system of the international federation,” Electroencephalogr. Clin. Neurophysiol., vol. 10, pp. 370–375, 1958.

58 S. J. Luck, An introduction to the event-related potential technique. MIT press, 2014.

59 P. L. Nunez, R. Srinivasan, et al., Electric fields of the brain: the neurophysics of EEG. Oxford University Press, USA, 2006.

60 F. Amzica and M. Steriade, “Electrophysiological correlates of sleep delta waves,” Electroencephalography and clinical neurophysiology, vol. 107, no. 2, pp. 69–83, 1998.

61 J. Winson, “Patterns of hippocampal theta rhythm in the freely moving rat,” Electroencephalography and clinical neurophysiology, vol. 36, pp. 291–301, 1974.

57 62 J. O’Keefe and M. L. Recce, “Phase relationship between hippocampal place units and the eeg theta rhythm,” Hippocampus, vol. 3, no. 3, pp. 317–330, 1993.

63 M. E. Hasselmo, “What is the function of hippocampal theta rhythm?âlinking behavioral data to phasic properties of field potential and unit recording data,” Hippocampus, vol. 15, no. 7, pp. 936–949, 2005.

64 R. E. Dustman, R. S. Boswell, and P. B. Porter, “Beta brain waves as an index of alertness,” Science, vol. 137, no. 3529, pp. 533–534, 1962.

65 C. Vanderwolf, “Are neocortical gamma waves related to consciousness?,” Brain research, vol. 855, no. 2, pp. 217–224, 2000.

66 C. S. Herrmann and A. Mecklinger, “Gamma activity in human eeg is related to highspeed memory comparisons during object selective attention,” Visual Cog- nition, vol. 8, no. 3-5, pp. 593–608, 2001.

67 J. Yordanova, V. Kolev, and T. Demiralp, “The phase-locking of auditory gamma band responses in humans is sensitive to task processing,” NeuroReport, vol. 8, no. 18, pp. 3999–4004, 1997.

68 J. S. Kwon, B. F. O’Donnell, G. V. Wallenstein, R. W. Greene, Y. Hirayasu, P. G. Nestor, M. E. Hasselmo, G. F. Potts, M. E. Shenton, and R. W. McCarley, “Gamma frequency–range abnormalities to auditory stimulation in schizophre- nia,” Archives of general psychiatry, vol. 56, no. 11, pp. 1001–1005, 1999.

69 J. Van Deursen, E. Vuurman, F. Verhey, V. van Kranen-Mastenbroek, and W. Riedel, “Increased eeg gamma band activity in alzheimerâs disease and mild cognitive impairment,” Journal of neural transmission, vol. 115, no. 9, pp. 1301– 1311, 2008.

70 G. G. Yener and E. Başar, “Brain oscillations as biomarkers in neuropsychiatric disorders: following an interactive panel discussion and synopsis,” in Supplements to Clinical neurophysiology, vol. 62, pp. 343–363, Elsevier, 2013.

71 P. J. Fitzgerald and B. O. Watson, “Gamma oscillations as a biomarker for major depression: an emerging topic,” Translational psychiatry, vol. 8, no. 1, pp. 1–7, 2018.

72 M. Plöchl, J. P. Ossandón, and P. König, “Combining eeg and eye tracking: identification, characterization, and correction of eye movement artifacts in elec- troencephalographic data,” Frontiers in human neuroscience, vol. 6, p. 278, 2012.

73 P. J. Allen, G. Polizzi, K. Krakow, D. R. Fish, and L. Lemieux, “Identification of eeg events in the mr scanner: the problem of pulse artifact and a method for its subtraction,” Neuroimage, vol. 8, no. 3, pp. 229–239, 1998.

58 74 G. R. Mangun and S. A. Hillyard, “Modulations of sensory-evoked brain po- tentials indicate changes in perceptual processing during visual-spatial prim- ing.,” Journal of Experimental Psychology: Human perception and performance, vol. 17, no. 4, p. 1057, 1991.

75 H. G. Vaughan Jr, L. D. Costa, and W. Ritter, “Topography of the human motor potential,” Electroencephalography and clinical neurophysiology, vol. 25, no. 1, pp. 1–10, 1968.

76 L. Deecke, P. Scheid, and H. H. Kornhuber, “Distribution of readiness potential, pre-motion positivity, and motor potential of the human cerebral cortex pre- ceding voluntary finger movements,” Experimental brain research, vol. 7, no. 2, pp. 158–168, 1969.

77 F. T. Smulders, J. O. Miller, S. Luck, et al., “The lateralized readiness potential,” The Oxford handbook of event-related potential components, pp. 209–229, 2012.

78 S. Sutton, M. Braren, J. Zubin, and E. R. John, “Evoked-potential correlates of stimulus uncertainty,” Science, vol. 150, no. 3700, pp. 1187–1188, 1965.

79 C. C. Duncan-Johnson and E. Donchin, “On quantifying surprise: The variation of event-related potentials with subjective probability,” Psychophysiology, vol. 14, no. 5, pp. 456–467, 1977.

80 N. K. Squires, K. C. Squires, and S. A. Hillyard, “Two varieties of long-latency positive waves evoked by unpredictable auditory stimuli in man,” Electroen- cephalography and clinical neurophysiology, vol. 38, no. 4, pp. 387–401, 1975.

81 J. B. Isreal, G. L. Chesney, C. D. Wickens, and E. Donchin, “P300 and track- ing difficulty: Evidence for multiple resources in dual-task performance,” Psy- chophysiology, vol. 17, no. 3, pp. 259–273, 1980.

82 J. Polich, “Updating p300: an integrative theory of p3a and p3b,” Clinical neu- rophysiology, vol. 118, no. 10, pp. 2128–2148, 2007.

83 S. S. Ball, J. T. Marsh, G. Schubarth, W. S. Brown, and R. Strandburg, “Lon- gitudinal p300 latency changes in alzheimer’s disease,” Journal of Gerontology, vol. 44, no. 6, pp. M195–M200, 1989.

84 J. Polich, L. Howard, and A. Starr, “P300 latency correlates with digit span,” Psychophysiology, vol. 20, no. 6, pp. 665–669, 1983.

85 L. Kangassalo, M. Spapé, G. Jacucci, and T. Ruotsalo, “Why do users issue good queries?: Neural correlates of term specificity,” in Proceedings of the 42Nd Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’19, (New York, NY, USA), pp. 375–384, ACM, 2019.

86 D. Jeffreys, “A face-responsive potential recorded from the human scalp,” Exper- imental brain research, vol. 78, no. 1, pp. 193–202, 1989.

59 87 D. Carmel and S. Bentin, “Domain specificity versus expertise: factors influencing distinct processing of faces,” Cognition, vol. 83, no. 1, pp. 1–29, 2002. 88 B. Rossion and C. Jacques, “The oxford handbook of event-related potential components,” 2012. 89 A. S. Ghuman, N. M. Brunet, Y. Li, R. O. Konecky, J. A. Pyles, S. A. Walls, V. Destefino, W. Wang, and R. M. Richardson, “Dynamic encoding of face infor- mation in the human fusiform gyrus,” Nature communications, vol. 5, p. 5672, 2014. 90 T. Allison, A. Puce, D. D. Spencer, and G. McCarthy, “Electrophysiological studies of human face perception. i: Potentials generated in occipitotemporal cortex by face and non-face stimuli,” Cerebral cortex, vol. 9, no. 5, pp. 415–430, 1999. 91 I. Deffke, T. Sander, J. Heidenreich, W. Sommer, G. Curio, L. Trahms, and A. Lueschow, “Meg/eeg sources of the 170-ms response to faces are co-localized in the fusiform gyrus,” Neuroimage, vol. 35, no. 4, pp. 1495–1501, 2007. 92 J. W. Tanaka and T. Curran, “A neural basis for expert object recognition,” Psychological science, vol. 12, no. 1, pp. 43–47, 2001. 93 B. Rossion, C.-C. Kung, and M. J. Tarr, “Visual expertise with nonface objects leads to competition with the early perceptual processing of faces in the hu- man occipitotemporal cortex,” Proceedings of the National Academy of Sciences, vol. 101, no. 40, pp. 14521–14526, 2004. 94 B. Rossion, D. Collins, V. Goffaux, and T. Curran, “Long-term expertise with artificial objects increases visual competition with early face categorization pro- cesses,” Journal of Cognitive Neuroscience, vol. 19, no. 3, pp. 543–555, 2007. 95 J. J. Vidal, “Toward direct brain-computer communication,” Annual review of Biophysics and Bioengineering, vol. 2, no. 1, pp. 157–180, 1973. 96 J. J. Vidal, “Real-time detection of brain events in eeg,” Proceedings of the IEEE, vol. 65, no. 5, pp. 633–641, 1977. 97 E. C. Lee, J. C. Woo, J. H. Kim, M. Whang, and K. R. Park, “A brain–computer interface method combined with eye tracking for 3d interaction,” Journal of neuroscience methods, vol. 190, no. 2, pp. 289–298, 2010. 98 R. Scherer, G. Müller-Putz, and G. Pfurtscheller, “Self-initiation of eeg-based brain–computer communication using the heart rate response,” Journal of neural engineering, vol. 4, no. 4, p. L23, 2007. 99 G. Pfurtscheller, R. Leeb, D. Friedman, and M. Slater, “Centrally controlled heart rate changes during mental practice in immersive virtual environment: a case study with a tetraplegic,” International journal of psychophysiology, vol. 68, no. 1, pp. 1–5, 2008.

60 100 M. Guirgis, T. Falk, S. Power, S. Blain, and T. Chau, “Harnessing physiological responses to improve nirs-based brain-computer interface performance,” in Proc. ISSNIP Biosignals Biorobotics Conf, pp. 59–62, 2010.

101 V. S. Polikov, P. A. Tresco, and W. M. Reichert, “Response of brain tissue to chronically implanted neural electrodes,” Journal of neuroscience methods, vol. 148, no. 1, pp. 1–18, 2005.

102 M. Serruya and J. Donoghue, “Chapter iii: Design principles of a neuromotor prosthetic device in neuroprosthetics: Theory and practice, ed. kenneth w. horch, gurpreet s. dhillon,” 2003.

103 F. Nijboer, B. Van De Laar, S. Gerritsen, A. Nijholt, and M. Poel, “Usability of three electroencephalogram headsets for brain–computer interfaces: a within subject comparison,” Interacting with computers, vol. 27, no. 5, pp. 500–511, 2015.

104 T. O. Zander and C. Kothe, “Towards passive brain–computer interfaces: apply- ing brain–computer interface technology to human–machine systems in general,” Journal of neural engineering, vol. 8, no. 2, p. 025005, 2011.

105 B. Blankertz, G. Dornhege, M. Krauledat, M. Schröder, J. Williamson, R. Murray-Smith, and K.-R. Müller, “The berlin brain-computer interface presents the novel mental typewriter hex-o-spell.,” 2006.

106 I. Kosunen, M. Salminen, S. Järvelä, A. Ruonala, N. Ravaja, and G. Jacucci, “Relaworld: neuroadaptive and immersive virtual reality meditation system,” in Proceedings of the 21st International Conference on Intelligent User Interfaces, pp. 208–217, 2016.

107 M. J. Khan and K.-S. Hong, “Passive bci based on drowsiness detection: an fnirs study,” Biomedical optics express, vol. 6, no. 10, pp. 4063–4078, 2015.

108 A. Kübler, B. Kotchoubey, J. Kaiser, J. R. Wolpaw, and N. Birbaumer, “Brain– computer communication: Unlocking the locked in.,” Psychological bulletin, vol. 127, no. 3, p. 358, 2001.

109 K. Li, R. Sankar, Y. Arbel, and E. Donchin, “Single trial independent compo- nent analysis for p300 bci system,” in 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 4035–4038, IEEE, 2009.

110 G. Sharma, D. A. Friedenberg, N. Annetta, B. Glenn, M. Bockbrader, C. Ma- jstorovic, S. Domas, W. J. Mysiw, A. Rezai, and C. Bouton, “Using an artificial neural bypass to restore cortical control of rhythmic movements in a human with quadriplegia,” Scientific reports, vol. 6, p. 33807, 2016.

61 111 F. Cincotti, D. Mattia, F. Aloise, S. Bufalari, G. Schalk, G. Oriolo, A. Cheru- bini, M. G. Marciani, and F. Babiloni, “Non-invasive brain–computer interface system: towards its application as assistive technology,” Brain research bulletin, vol. 75, no. 6, pp. 796–803, 2008.

112 K. Ferguson, Stephen Hawking: An Unfettered Mind. St. Martin’s Press, 2012.

113 D. E. Duncan, “A little device thatâs trying to read your thoughts,” New York Times, 2012.

114 J. Medeiros, “How intel gave stephen hawking a voice,” Wired, 2015.

115 P. Denman, L. Nachman, and S. Prasad, “Designing for" a" user: Stephen hawking’s ui,” in Proceedings of the 14th Participatory Design Conference: Short Papers, Interactive Exhibitions, Workshops-Volume 2, pp. 94–95, 2016.

116 J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M. Vaughan, “Brain–computer interfaces for communication and control,” Clinical neurophysiology, vol. 113, no. 6, pp. 767–791, 2002.

117 A. Kubler, F. Nijboer, J. Mellinger, T. M. Vaughan, H. Pawelzik, G. Schalk, D. J. McFarland, N. Birbaumer, and J. R. Wolpaw, “Patients with ALS can use sensorimotor rhythms to operate a brain-computer interface,” Neurology, vol. 64, pp. 1775–1777, May 2005.

118 S. Borgheai, J. McLinden, A. Zisk, S. Hosni, R. Deligani, M. Abtahi, K. Mankodiya, and Y. Shahriari, “Enhancing communication for people in late- stage als using an fnirs-based bci system.,” IEEE transactions on neural sys- tems and rehabilitation engineering: a publication of the IEEE Engineering in Medicine and Biology Society, 2020.

119 E. W. Sellers, T. M. Vaughan, and J. R. Wolpaw, “A brain-computer interface for long-term independent home use,” Amyotrophic lateral sclerosis, vol. 11, no. 5, pp. 449–455, 2010.

120 K. Palin, A. M. Feit, S. Kim, P. O. Kristensson, and A. Oulasvirta, “How do people type on mobile devices? observations from a study with 37,000 volun- teers,” in Proceedings of the 21st International Conference on Human-Computer Interaction with Mobile Devices and Services, pp. 1–12, 2019.

121 V. Dhakal, A. M. Feit, P. O. Kristensson, and A. Oulasvirta, “Observations on typing from 136 million keystrokes,” in Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–12, 2018.

122 O. E. Krigolson, C. C. Williams, A. Norton, C. D. Hassall, and F. L. Colino, “Choosing muse: Validation of a low-cost, portable eeg system for erp research,” Frontiers in neuroscience, vol. 11, p. 109, 2017.

62 123 G. Wiechert, M. Triff, Z. Liu, Z. Yin, S. Zhao, Z. Zhong, R. Zhaou, and P. Lin- gras, “Identifying users and activities with cognitive signal processing from a wearable headband,” in 2016 IEEE 15th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC), pp. 129–136, IEEE, 2016.

124 Z. Ni, A. C. Yuksel, X. Ni, M. I. Mandel, and L. Xie, “Confused or not con- fused? disentangling brain activity from eeg data using bidirectional lstm recur- rent neural networks,” in Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 241–246, 2017.

125 F. Lotte, M. Congedo, A. Lécuyer, F. Lamarche, and B. Arnaldi, “A review of classification algorithms for eeg-based brain–computer interfaces,” Journal of neural engineering, vol. 4, no. 2, p. R1, 2007.

126 F. Lotte, L. Bougrain, A. Cichocki, M. Clerc, M. Congedo, A. Rakotomamonjy, and F. Yger, “A review of classification algorithms for eeg-based brain–computer interfaces: a 10 year update,” Journal of neural engineering, vol. 15, no. 3, p. 031005, 2018.

127 R. A. Fisher, “The use of multiple measurements in taxonomic problems,” An- nals of eugenics, vol. 7, no. 2, pp. 179–188, 1936.

128 C. M. Bishop, “Pattern recognition and machine learning,” pp. 181–192, Springer, 2006.

129 K. Fukunaga, “Introduction to statistical pattern recognition,” pp. 445–466, Academic Press, 1990.

130 T. Hastie, R. Tibshirani, and J. Friedman, “The elements of statistical learn- ing: data mining, inference, and prediction,” pp. 106–119, Springer Science & Business Media, 2009.

131 H. Kang and S. Choi, “Bayesian common spatial patterns for multi-subject eeg classification,” Neural Networks, vol. 57, pp. 39–50, 2014.

132 D. J. Acunzo, G. MacKenzie, and M. C. van Rossum, “Systematic biases in early erp and erf components as a result of high-pass filtering,” Journal of neuroscience methods, vol. 209, no. 1, pp. 212–218, 2012.

133 T. P. Urbach and M. Kutas, “Interpreting event-related brain potential (erp) distributions: implications of baseline potentials and variability with application to amplitude normalization by vector scaling,” Biological psychology, vol. 72, no. 3, pp. 333–343, 2006.

134 S. Makeig, A. J. Bell, T.-P. Jung, and T. J. Sejnowski, “Independent component analysis of electroencephalographic data,” in Advances in neural information processing systems, pp. 145–151, 1996.

63 135 R. N. Vigário, “Extraction of ocular artefacts from eeg using independent com- ponent analysis,” Electroencephalography and clinical neurophysiology, vol. 103, no. 3, pp. 395–404, 1997.

136 I. Winkler, S. Haufe, and M. Tangermann, “Automatic classification of artifac- tual ica-components for artifact removal in eeg signals,” Behavioral and Brain Functions, vol. 7, no. 1, p. 30, 2011.

137 F. Ricci, L. Rokach, and B. Shapira, Introduction to recommender systems handbook. Springer, 2011.

138 D. H. Park, H. K. Kim, I. Y. Choi, and J. K. Kim, “A literature review and clas- sification of recommender systems research,” Expert Systems with Applications, vol. 39, pp. 10059–10072, Sept. 2012.

139 L. Lü, M. Medo, C. H. Yeung, Y.-C. Zhang, Z.-K. Zhang, and T. Zhou, “Rec- ommender systems,” Physics reports, vol. 519, no. 1, pp. 1–49, 2012.

140 J. Bobadilla, F. Ortega, A. Hernando, and A. Gutiérrez, “Recommender systems survey,” Knowledge-Based Systems, vol. 46, pp. 109–132, July 2013.

141 G. Linden, B. Smith, and J. York, “Amazon. com recommendations: Item-to- item collaborative filtering,” IEEE Internet computing, vol. 7, no. 1, pp. 76–80, 2003.

142 A. Paterek, “Improving regularized singular value decomposition for collabo- rative filtering,” in Proceedings of KDD cup and workshop, vol. 2007, pp. 5–8, 2007.

143 J. Bennett, S. Lanning, et al., “The netflix prize,” in Proceedings of KDD cup and workshop, vol. 2007, p. 35, Citeseer, 2007.

144 S. Funk, “Netflix update: Try this at home.” https://sifter.org/~simon/ journal/20061211.html, 2006.

145 Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recom- mender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009.

146 T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” arXiv preprint arXiv:1710.10196, 2017.

147 M. Spape, R. Verdonschot, and H. v. Steenbergen, The E-Primer: An intro- duction to creating psychological experiments in E-Prime R : Second edition updated for E-Prime 3. Leiden Univerisity Press, 2019.

148 M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,” Information processing & management, vol. 45, no. 4, pp. 427–437, 2009.

64 149 D. M. Powers, “Evaluation: from precision, recall and f-measure to roc, in- formedness, markedness and correlation,” 2011.

150 M. Bekkar, H. K. Djemaa, and T. A. Alitouche, “Evaluation measures for models assessment over imbalanced data sets,” J Inf Eng Appl, vol. 3, no. 10, 2013.

151 O. Ledoit and M. Wolf, “A well conditioned estimator for large dimensional covariance matrices,” 2000.

152 H. J. Jung and M. Lease, “Improving quality of crowdsourced labels via prob- abilistic matrix factorization,” in Workshops at the Twenty-Sixth AAAI Confer- ence on Artificial Intelligence, 2012.

153 A. Azizian, A. Freitas, T. Watson, and N. Squires, “Electrophysiological corre- lates of categorization: P300 amplitude as index of target similarity,” Biological Psychology, vol. 71, no. 3, pp. 278–288, 2006.

154 J. E. Hoffman, R. F. Simons, and M. R. Houck, “Event-related potentials during controlled and automatic target detection,” Psychophysiology, vol. 20, no. 6, pp. 625–632, 1983.

155 M. Ojala and G. C. Garriga, “Permutation tests for studying classifier perfor- mance,” Journal of Machine Learning Research, vol. 11, no. Jun, pp. 1833–1863, 2010.

156 P. Good, Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer, 2nd ed., 2000.

157 L. A. Farwell and E. Donchin, “Talking off the top of your head: toward a mental prosthesis utilizing event-related brain potentials,” Electroencephalography and clinical Neurophysiology, vol. 70, no. 6, pp. 510–523, 1988.

158 A. C. Metting van Rijn, A. Peper, and C. A. Grimbergen, “High-quality record- ing of bioelectric events,” Medical and Biological Engineering and Computing, vol. 28, no. 5, pp. 389–397, 1990.

159 S. Debener, R. Emkes, M. De Vos, and M. Bleichner, “Unobtrusive ambulatory eeg using a smartphone and flexible printed electrodes around the ear,” Scientific reports, vol. 5, p. 16743, 2015.

160 N. Naseer, M. J. Hong, and K.-S. Hong, “Online binary decision decoding us- ing functional near-infrared spectroscopy for the development of brain–computer interface,” Experimental Brain Research, vol. 232, pp. 555–564, Feb 2014.

161 E. Boto, N. Holmes, J. Leggett, G. Roberts, V. Shah, S. S. Meyer, L. D. Muñoz, K. J. Mullinger, T. M. Tierney, S. Bestmann, G. R. Barnes, R. Bowtell, and M. J. Brookes, “Moving magnetoencephalography towards real-world applications with a wearable system,” Nature, vol. 555, no. 7698, pp. 657–661, 2018.

65 162 D. Hagemann, J. Hewig, C. Walter, and E. Naumann, “Skull thickness and mag- nitude of eeg alpha activity,” Clinical Neurophysiology, vol. 119, no. 6, pp. 1271– 1280, 2008.

163 M. Dannhauer, B. Lanfer, C. H. Wolters, and T. R. Knösche, “Modeling of the human skull in eeg source analysis,” Human Brain Mapping, vol. 32, no. 9, pp. 1383–1399, 2011.

164 T. R. Bashore and M. W. van der Molen, “Discovery of the P300: A tribute,” Biological Psychology, vol. 32, pp. 155–171, Oct. 1991.

165 J. Polich, “Attention, probability, and task demands as determinants of P300 latency from auditory stimuli,” Electroencephalography and clinical neurophysi- ology, vol. 63, no. 3, pp. 251–259, 1986.

166 S. Yamaguchi and R. T. Knight, “P300 generation by novel somatosensory stim- uli,” Electroencephalography and clinical neurophysiology, vol. 78, no. 1, pp. 50– 55, 1991.

167 K. Sano, Y. Tsuda, H. Sugano, S. Aou, and A. Hatanaka, “Concentration Effects of Green Odor on Event-related Potential (P300) and Pleasantness,” Chemical Senses, vol. 27, pp. 225–230, 03 2002.

168 M. Kutas and K. D. Federmeier, “Thirty Years and Counting: Finding Meaning in the N400 Component of the Event-Related Brain Potential (ERP),” Annual Review of Psychology, vol. 62, pp. 621–647, Dec. 2010.

169 M. Soleymani, S. Asghari-Esfeden, Y. Fu, and M. Pantic, “Analysis of eeg signals and facial expressions for continuous emotion detection,” IEEE Transactions on Affective Computing, vol. 7, pp. 17–28, Jan 2016.

170 Y. Lin, C. Wang, T. Jung, T. Wu, S. Jeng, J. Duann, and J. Chen, “Eeg- based emotion recognition in music listening,” IEEE Transactions on Biomedical Engineering, vol. 57, pp. 1798–1806, July 2010.

171 A. T. Sohaib, S. Qureshi, J. Hagelbäck, O. Hilborn, and P. Jerčić, “Evaluat- ing classifiers for emotion recognition using eeg,” in Foundations of Augmented Cognition (D. D. Schmorrow and C. M. Fidopiastis, eds.), (Berlin, Heidelberg), pp. 492–501, Springer Berlin Heidelberg, 2013.

172 D. Plass-Oude Bos, “Eeg-based emotion recognition,” The Influence of Visual and Auditory Stimuli, 01 2006.

173 S. Sedhain, A. K. Menon, S. Sanner, and L. Xie, “Autorec: meet collaborative filtering,” in Proceedings of the 24th international conference on , pp. 111–112, 2015.

66 174 L. C. Irani and M. S. Silberman, “Turkopticon: Interrupting worker invisibility in amazon mechanical turk,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’13, (New York, NY, USA), pp. 611–620, ACM, 2013. 175 R. Y. Wong, N. Merrill, and J. Chuang, “When bcis have apis: Design fictions of everyday brain-computer interface adoption,” in Proceedings of the 2018 Design- ing Interactive Systems Conference, DIS ’18, (New York, NY, USA), pp. 1359– 1371, ACM, 2018. 176 S. Burwell, M. Sample, and E. Racine, “Ethical aspects of brain computer interfaces: a scoping review,” BMC Medical Ethics, vol. 18, no. 1, p. 60, 2017. 177 M. Frank, T. Hwu, S. Jain, R. T. Knight, I. Martinovic, P. Mittal, D. Perito, I. Sluganovic, and D. Song, “Using eeg-based bci devices to subliminally probe for private information,” in Proceedings of the 2017 on Workshop on Privacy in the Electronic Society, WPES ’17, (New York, NY, USA), pp. 133–136, ACM, 2017. 178 G. Vecchiato, J. Toppi, F. Cincotti, L. Astolfi, F. D. V. Fallani, F. Aloise, D. Mattia, S. Bocale, F. Vernucci, and F. Babiloni, “Neuropolitics: Eeg spec- tral maps related to a political vote based on the first impression of the candi- date’s face,” in 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, pp. 2902–2905, IEEE, 2010. 179 A. Telpaz, R. Webb, and D. J. Levy, “Using eeg to predict consumersâ future choices,” Journal of Marketing Research, vol. 52, no. 4, pp. 511–529, 2015. 180 T. F. Chan, “China is monitoring employees’ brain waves and emotions - and the technology boosted one company’s profits by $315 million,” Business Insider, May 2018. 181 S. Chen, “Forget the facebook leak: China is mining data directly from workers’ brains on an industrial scale,” South China Morning Post, April 2018. 182 S. Boyd, A. Harden, and M. Patton, “The eeg in early diagnosis of the angel- man (happy puppet) syndrome,” European journal of pediatrics, vol. 147, no. 5, pp. 508–513, 1988. 183 C. Miyamoto, S. Baba, and I. Nakanishi, “Biometric person authentication us- ing new spectral features of electroencephalogram (eeg),” in 2008 international symposium on intelligent signal processing and communications systems, pp. 1–4, IEEE, 2009. 184 L. De Gennaro, C. Marzano, F. Fratello, F. Moroni, M. C. Pellicciari, F. Fer- lazzo, S. Costa, A. Couyoumdjian, G. Curcio, E. Sforza, A. Malafosse, L. A. Finelli, P. Pasqualetti, M. Ferrara, M. Bertini, and P. M. Rossini, “The elec- troencephalographic fingerprint of sleep is genetically determined: A twin study,” Annals of Neurology, vol. 64, no. 4, pp. 455–460, 2008.

67 185 S. J. Oh, B. Schiele, and M. Fritz, Towards Reverse-Engineering Black-Box Neural Networks, pp. 121–144. Cham: Springer International Publishing, 2019.

186 P. Shenoy and D. S. Tan, “Human-aided computing: Utilizing implicit human processing to classify images,” in Proceedings of the SIGCHI Conference on Hu- man Factors in Computing Systems, CHI ’08, (New York, NY, USA), pp. 845– 854, ACM, 2008.

68