Towards the Amelioration of Classification Models for Evoked Potentials in -Computer Interface

by

Chad Anthony Mello

B.S., College of Santa Fe, Santa Fe 2009

M.S., University of Colorado Colorado Springs 2014

A dissertation submitted to the Graduate Faculty of the

University of Colorado Colorado Springs

in partial fulfillment of the

requirements for the degree of

Doctor of Philosophy

Department of Computer Science

2020 c 2020 Chad Anthony Mello All Rights Reserved ii

This dissertation for the Doctor of Philosophy degree by

Chad Anthony Mello

has been approved for the

Department of Computer Science

by

Terrance Boult, Chair

Jugal Kalita

Ethan Rudd

Andrew White

Yanyan Zhuang

Date: December 16th, 2020 iii

Mello, Chad Anthony (Ph.D., Engineering: Computer Science)

Towards the Amelioration of Classification Models for Evoked Potentials in Brain-Computer

Interface

Dissertation directed by Professor Terrance E. Boult

ABSTRACT

Brain-Computer Interface technology has the potential to improve the lives of millions of

people around the world. This study investigates how we may improve the performance of

brain-computer interface for evoked potentials; we address some of the predominant chal-

lenges that deter its widespread availability and application, demonstrating ways to augment

system bootstrapping and performance with the adaptation of classifiers that seem better

suited to generalizing across human electroencephalographic data. This dissertation intro-

duces ways in which deep transfer learning, together with interpretability, may ameliorate

the approach to deploying pre-trained brain-computer interface systems that generalize well

across users and tasks. iv

To the one who gave more than she had to give to this effort...

Imelda, thank you for providing strength and encouragement on those days when I could find neither. Here’s looking to another

twenty-seven years with you!

Te adoro.

Te necesito.

Te Amo. ACKNOWLEDGMENTS

I’d like to thank my PhD committee members for taking the time to participate in this effort.

Some of you took extra time to learn about certain topics just so you could contribute here.

Thank you for that! Each member on the committee was chosen because you brought

something important to the table that my work ultimately benefited from in some way.

Thank you for supporting and encouraging me through this journey.

I’d like to take this opportunity to thank my advisor and committee chair, Dr. Ter-

rance Boult, for his guidance and support throughout this process. Terry provided job

opportunities, invaluable insight, wisdom, and expert knowledge in critical areas. Terry

guided me towards putting all of this research into a coherent, cohesive body of work.

Above all, Terry provided calm at times when I was about to commit Hara Kiri. Dr. Boult

is undeniably a world-class researcher, and I am honored for having had the opportunity to work with him.

I must also extend special acknowledgment and thanks to Ginger Boult who cleared the way for Terry to take on the advisory role for my work. She did this knowing she risked seeing less of her husband during that time period. Thank you, Ginger, for sharing some of your husband’s time with me. It means a lot to me, because I know it meant a lot to you.

How could I not thank Ali Langfels for putting up with all my absent-minded mistakes and missed deadlines that undoubtedly made her life hell at times. Ali helped to keep me on track with accurate record-keeping, proctored my PhD examinations, and kept items that were important for graduating in front of my face the entire time. She made sure that

I dotted all i’s and crossed all t’s. I would not be where I am here if she were not there.

Thank you, Ali. vi

A special thanks to Dr. Ethan Rudd who on a regular basis provided me with much insight through his experience and vast knowledge in machine learning, Python and beyond. He gave up many hours of personal time to allow me to bounce ideas off of him. His direct efforts resulted in much-welcomed funding that went towards this work, and it was a tremendous blessing. Ethan was instrumental beyond what I could express here. Thank you, Ethan.

I’ve worked with Dr. Andrew White for many years now. It’s an honor to continue to work with him. Over the years he’s provided insight into how the brain works. Much of my knowledge in stems from the hours we spent on interesting and exciting research in epileptic brain function. Since then I’ve come to rely on Andy’s professional knowledge and understanding of the human brain. Thank you, Andy, for sharing your insight and for your continued interest and support of my work.

I’d like to thank Dr.Yanyan Zhuang for agreeing to join my committee. From her feedback I’ve gained a lot of insight into how other people outside of BCI and machine learning might see this work. Yanyan passed along many pointers that helped me to prepare for my defense, and I honestly felt much more confident going into the defense after spending considerable time going over it with her beforehand. Thank you, Yanyan.

Dr. Jugal Kalita provided valuable help over the years, and I’d like to thank him for preparing me for my defense by allowing me to give presentations on various topics related to my work to his graduate students. Presenting ideas and having those ideas challenged is the better part of what science and research is about. This hands-on approach helped me to become a better skeptic and debater. Jugal, thank you for the tough questions and for challenging my ideas. It all helped to bring the bigger picture into focus. vii

I’d like to express a special thank you to Dr. Rory Lewis. I learned a lot by working with Rory throughout my grad school years. I never knew when Dr. Lewis might call me up in front of a crowd of people to speak, ask me to conduct a last-minute lecture for one of his classes, or when he might correct me while speaking in front of any audience. This kept me on my toes and forced me to become a better public speaker; a critical life skill for anyone looking to become anything in academia. I learned what it was like to author a book, to write and publish research papers, and to dream up wild ideas... all the things that make doing science fun. Some of the work we did together is covered in Chapter 3.

Thanks for the wild ride, Rory.

To the staff who make UCCS go, and those faculty members who dedicate themselves

to teaching and mentoring the next generation of scientists and engineers, thank you for all

your hard work, and thank you for extending the honor and privilege to me by allowing me

to share in the experience through lecturing and teaching undergraduate courses at UCCS.

I know your job can be tough at times, perhaps leaving you feeling discouraged on occasion.

Take solace in the fact that you do make a difference in the lives of many people who are

looking to create a meaningful and purposeful life on this earth.

I’d like to thank the United States Air Force Academy CS faculty and staff for their

encouragement and support since I started working there less than a year ago. They helped

me to stay focused while encouraging me to put a lid on this work sooner rather than later.

Specifically to Dr. Troy Weingart for dangling the carrot on a stick, prodding me towards

an assistant professor position with the academy after graduation. Thank you!

Two very good friends, Phillip Blanton and Terry Torres, provided me with nonstop

energy and encouragement every step of the way. I want to thank them for demonstrating

levels of positivity and enthusiasm such that I’ve never experienced before. That sort of viii tangible energy can drive people to accomplish extraordinary things. Gentlemen, thank you for sharing it with me.

To Mom, Dad, Luke, and Amanda: Thank you for never balking at my crazy ideas, for all your patience and endless love. To my daughters, Tanya and Rachel, thank you for putting up with your Dad’s absentmindedness and crankiness over the last few years (as if I was ever anything but). My son and I missed out on a lot of time together in recent months. Thank you for understanding, David. I’m thankful that we still have a few more years left before you embark on your own adventures. You can bet I’ll be there to help you along.

For all the times she had picked up the slack when I couldn’t be there, I want to thank my wife, Imelda. She never hounded, she never browbeat, and she always gave me the space and time I needed to work. She kept things together and maintained a sense of normalcy for the family when things were anything but normal. She was there every step of the way; she shared in the highs as well as the lows. Imelda, thank you for all your hard work, grace and poise throughout this journey. TABLE OF CONTENTS

1 Thesis, Motivation, Claims, and Contributions 1

1.1 Terminology...... 2

1.2 Motivation ...... 4

1.3 Claims...... 8

1.4 InvestigatoryItems...... 9

1.5 Contributions...... 10

2 Whither Goest thou, BCI? 11

2.1 A New Era of Technology: Fuel for Renewed Interest ...... 12

2.2 BCI: The ”Interface” in Brain-Computer Interface ...... 13

2.3 P300 BCI ...... 14

2.3.1 Trends in P300 BCI Application ...... 18

2.3.2 Mobility ...... 18

2.3.3 Electronic Prosthesis ...... 20

2.3.4 Gaming ...... 21

2.3.5 Home Assistance and Environmental Control ...... 22

2.3.6 Assessing Attention State and Mindfulness ...... 22

2.3.7 Other P300 BCI Uses ...... 23

2.3.8 Spelling and Communication Systems (P300 SCS) ...... 23

2.4 Assumptions, Limitations, and Delimitation ...... 24

2.4.1 Electroencephalography, Nothing More ...... 26

3 Previous Work: Classifying Oscillatory Brain Function 28

3.1 EEG Feature Generation and Examination ...... 29 x

3.1.1 Theory for Observable Seizure Progression in EEGs ...... 30

3.1.2 The neuralClustering Algorithm ...... 31

3.1.2.1 Signal Bisection ...... 31

3.1.2.2 Signal Preparation ...... 32

3.1.3 Summations of Amplitudinal Area ...... 35

3.1.4 Temporally-Bounded Clustering ...... 37

3.1.5 Movement of Centroids Through Time ...... 38

3.2 Classifying Oscillatory Brain Function ...... 40

3.2.1 A Different Perspective on Scatter Plot ...... 41

3.2.2 Human-Assisted Classification ...... 42

3.3 Selecting Linearly Separable Class Attributes ...... 44

3.4 Results and Comparisons ...... 47

3.5 The Custom EEG Software Suite: IERSS ...... 49

3.6 Visual Signal Inspection and Manipulation ...... 50

3.7 Data Transformation and Visual Animations ...... 53

3.8 The Data Classification Tool ...... 54

3.9 TheTrainingTool ...... 55

3.10 A Classification Example: the Seizure Identification System ...... 55

3.11 Conclusions ...... 57

4 Review of ERP and Popular BCI Classifiers 59

4.1 Event-RelatedPotentials(ERPs) ...... 59

4.1.1 Steady-State Visual-Evoked Responses ...... 61

4.1.2 Sensorimotor Rhythms [255] ...... 61

4.1.3 The P300 Wave ...... 62 xi

4.2 P300 BCI: A Closer Look [171, 174] ...... 62

4.3 EPDetection ...... 65

4.4 A (Brief) Review of Traditional Classification Algorithms for BCI . . . . . 67

4.4.1 Linear discriminant analysis (LDA) ...... 68

4.4.2 Support vector machine (SVM) ...... 68

4.4.3 Random Forests (RF) ...... 69

4.4.4 Dynamic stopping (DS) ...... 69

4.4.5 Hidden Markov Models (HMMs) ...... 69

4.4.6 Convolutional Neural Networks (CNN) ...... 70

4.5 Riemannian-based Classifiers ...... 70

4.6 DeepLearning ...... 71

4.7 Deep Transfer Learning ...... 72

4.8 Conclusion ...... 72

5 The Dataset & Analysis 74

5.1 Why Alcoholism? ...... 76

5.2 TheAnalysis ...... 76

6 Experimental Comparison 82

6.1 ModelOverview ...... 82

6.1.1 Feed-forward Neural Network (FFNN) ...... 82

6.1.2 Convolutional Neural Network (CNN) ...... 84

6.1.3 Random Forest (RF) ...... 84

6.1.4 Gradient Boosting (GB) ...... 85

6.1.5 Linear Discriminant Analysis (LDA) ...... 85 xii

6.1.6 Logistic Regression (LR) ...... 86

6.1.7 Riemannian Minimum Distance To Mean (MDM) ...... 86

6.2 TheExperiments...... 87

6.3 Receiver Operating Characteristic (ROC) Curves ...... 91

6.4 Match Vs No-match Experiment ...... 94

6.5 High-Risk Vs Low-Risk Experiment ...... 96

6.6 Conclusions...... 97

7 Transfer Learning Experiments 98

7.1 Transfer Learning for BCI ...... 98

7.2 Transfer Learning Theoretical Foundations ...... 100

7.3 Methodology ...... 102

7.3.1 The Baseline (CNN) Model ...... 104

7.4 TheExperiments...... 106

7.4.1 Pre-train, Fine-tune (w/ and w/o weight-freezing; same task) . . . . 106

7.4.2 Pre-train then fine-tune (cross-task) ...... 107

7.4.3 Pre-train w/ auto-encoder then fine-tune ...... 107

7.4.4 Pre-train w/ Siamese loss then fine-tune ...... 109

7.5 TheResults...... 111

7.5.1 Match vs No-match ...... 112

7.5.2 High-risk vs Low-risk ...... 115

7.5.3 Cross-task Results ...... 116

7.6 Conclusions ...... 116

8 Interpretability 118 xiii

8.1 AI’s Black Box ...... 119

8.2 WhatisInterpretability?...... 120

8.2.1 Explainability ...... 121

8.3 Global vs. Local Interpretability ...... 121

8.4 AdditiveFeatureAttribution ...... 122

8.5 Shapley Values ...... 124

8.6 Kernel SHapley Additive exPlanations ...... 125

8.7 Integrated Gradients (IG) ...... 126

8.8 Interpretable Model-Agnostic Explanations (LIME) ...... 127

8.8.1 Applying Interpretability to Electroencephalography ...... 128

8.9 Approach to Explainability: Integrated Gradients (IG) ...... 130

8.10 InterpretingtheExplainers ...... 132

8.10.1 High Risk Vs. Low Risk Explainers ...... 133

8.10.2 Match Vs. No-Match Explainers ...... 134

8.11 Conclusions ...... 142

9 Conclusions and Discussion 143

9.1 FutureWork ...... 144

References 145

A Additional Experimental Perspectives 166

A.1 Traditional Classifier Experiments ...... 166

A.2 Traditional Vs. Transfer Experiments ...... 169

B About Electroencephalograms 172 xiv

B.1 Quantitative Electroencephalography (qEEG) ...... 174

C Biosignals 175

C.1 Biosignals as Bodies of Continuous Data ...... 175

C.1.1 Signals Classified ...... 176

C.1.2 Signals of Interest & Associated Data ...... 177

C.1.2.1 Electromyography (EMG) ...... 178

C.1.3 ECG ...... 178

C.1.4 Electrodermography (EDG) ...... 179

D Software Packages and Datasets 180

D.1 Other Available BCI Data Sets ...... 180 TABLES

3.1 Filters and Techniques: kernels applied with various kernel sizes. In the case

of the windowed-sinc filters, they worked best using larger kernel sizes, but

performance was poor; these filter kernels are Gaussian, and therefore require

higher fidelity, and best performed using frequency convolution through FFT

[206]...... 33

3.2 Results of identifying pathological oscillations in rats from 42 gb of EEG data 49 FIGURES

2.1 (Top) The RC speller highlights a whole row or column at once with 6 sym-

bols each. (Bottom, left) The SC speller highlights each character individu-

ally. (Bottom, right) Electrode positions according to the international 10/20

electrode system used for the EEG measurements. The head is viewed from

above and the nose points to the top of the page. Source: [86]...... 16

2.2 P300 BCI wheelchair prototype. Source: [183] ...... 18

2.3 Brian-Computer Interface c ”The Brain-Computer Interface Project” http:

//www.ece.ubc.ca/\protect\unhbox\voidb@x\penalty\@M\{}garyb/BCI.

htm...... 24

3.1 EEG showing onset of Seizure. Note the drift in amplitude and frequency

from left to right as seizure progresses. Frequency drops while amplitude

increases - this is typical manifestation of neuronal hyper-synchronization. . 30

3.2 Section of EEG bisected with mean line (in red) created using line segments

averaged over 500 sample points, connected and interpolated...... 32

3.3 Originally taken from a 4-value boxcar kernel, the new kernel functions as

a differential function that forces a signal to become symmetric around the

origin - zero...... 34 xvii

3.4 Non-Seizure. (a) Notice wide spread as well as amplitude clipping. (b) Notice

that the signal revolves around the zero line, and that the mean line is at or

near zero as well. Spikes on both ends are easily classified as non-seizure. . 34

3.5 Seizure: Applying the differential filter ultimately leaves seizure intact, while

minimizing noise and artifact. In (b), artifact on left is easily classified, while

seizurecanalsobecorrectlyclassified...... 35

3.6 (a) Amplitudinal area above and below the mean line is converted into 2-

dimensional points where Y = area and X = temporal duration. (b) is actual

screenshot of EEG with calculated amplitudinal sums in red...... 36

3.7 Initial verifiable evidence in Matlab showing clustered movement during

seizure: 8-second windows with 7-second overlap. (A) indicates a time be-

fore onset at origin, (B) shows beginning of seizure movement, (C) illustrates

movement away from origin as seizure progresses, (E) shows how activity

goes back toward origin as seizure subsides. Total duration of movement:

2.33minutes...... 37

3.8 Example of moving centroid. Shown is about 11 minutes of EEG activity.

Centroid’s current position is marked in red, while its past positions are

shaded gray. Current position corresponds to seizure activity. Notice dense

area in lower-right; this corresponds to normal brain function. Each point

represents a 3-second window of time, while the tail of the centroidal vector

reaches back 1 second in time. For visual purposes, the time forward was

incremented .05 seconds (2.95-second overlap)...... 39 xviii

3.9 4 seizures from 2 different animals (2 for each rat). Notice how the patterns

are similar, and the centroids for each cluster (circled in red) are close in

proximity one to another. Each seizure represents between 30 and 60 seconds

ofactivity...... 40

3.10 9 normal oscillatory sections taken at various times from one animal. No-

tice how similar patterns can be grouped. Even though there are distinctly

two different patterns here, they occur at various times throughout the EEG

recordings. This model comparison points towards a way to quantify dis-

tinctly different states of brain function that are regularly repeated. Each

cluster represents roughly 4 to 20 seconds of activity...... 40

3.11 Black clusters are normal brain function, while red clusters are seizures. The

larger squares are the cluster centroids. Notice that from this perspective,

the data is not linearly separable...... 41

3.12 Recap: (I) EEG data and signal prep. to (II) bisection, amplitudinal area

vs. temporal duration, bounded and overlapped time windows of activity,

reduced to centroidal movement to (III) centroidal scatter plot...... 42

3.13 At the top-right, a scatter plot representing approximately 13 minutes of

centroidal cluster movement. Below, stretching from left to right, is the

same cluster, only it is ”stretched” out sequentially and bisected with a red

mean line known as the feature threshold line. Notice a large ”hill” in the

middle. This is a seizure lasting about 46 seconds...... 43 xix

3.14 Now we have a way to select specific features from a large cluster. In white

is a selected seizure feature cluster that crosses above the feature threshold

line. The corresponding cluster segment is pictured in purple at the upper

right, with its own centroid in red. We have now defined a new cluster that

is a subset of the larger structure. This can now be classified as a seizure in

discreteterms...... 44

3.15 Just as we were able to select seizure features in Figure 3.14, we can select

non-seizure features and classify them as well. Note that 1 is seizure, 2 is

corresponding cluster, 3 and 5 are non-seizure, and 4 and 6 are the related

clusters with centroids displayed prominently...... 44

3.16 Feature area vs. arc length: (a) green is normal function, while blue is seizure.

Notice the high covariance that exists between X and Y. (b) This data is also

linearly separable, and can be used to train a linear classifier...... 46

area pCount 3.17 Y= cDistance vs. X= cDistance : (a) green is normal function, while blue is

seizure. (b) This data is also linearly separable, and can be used to train

alinearclassifier...... 47

3.18 IERSS preview window allows user to search and inspect parts of the signal

before loading the data...... 50

3.19 IERSS allows visual inspection and the application of a bisecting line, too.

the bisecting line is displayed in red, while the physical intersection is dis-

played using blue boxes. The zero line is a thick gray horizontal line. . . . . 51 xx

3.20 IERSS allows visual inspection of a selected intersection between the signal

andthebisectinglinefunction...... 51

3.21 IERSS allows the user to turn on shaded area view for inspecting signal

features where it crosses the bisecting line function...... 51

3.22 IERSS offers the ability to zoom into the actual individual signal samples for

closer examination...... 52

3.23 IERSS performing centroidal animation time-synced with the signal at the

top. Notice the blue vertical bar over the signal. The bar’s width indicates

the size of the time window (3 seconds), and it moves in sync with the moving

centroid below it. At the bottom is the feature threshold line as described in

chapter 3.2; it is being formed in sync with centroidal movement...... 53

3.24 Shown here is the classification UI. The user may select any feature above the

feature threshold by clicking on it in the bottom window (shown in white).

The associated points are highlighted in the scatter plot window for visual

confirmation (purple with red centroid). The user may classify the feature

by selecting ”seizure” or ”non-seizure” from the dialog...... 54

3.25 The classification training tool allows training on two different planes, and

allows the user to select which elements from the training set to train against.

Once the plane and elements are selected, clicking the ”train” button will

invoke the associated PLA algorithm...... 55 xxi

3.26 The seizure identification system, like all other IERSS functions, is visual.

It keeps a running log, and highlights identified seizures when discovered in

the bottom feature threshold window. The window displays the the current

segment of EEG file being analyzed ...... 56

3.27 Here is an example of the thumbnails IERSS produces upon positive identi-

ficationofseizure...... 57

4.1 Various ERP through time with locale. Source: [57] ...... 59

4.2 Prominent P300 Peak. c 2010 Source: [5] ...... 63

4.3 P3a and P3b components during three different stimuli runs. Source: [29] . 65

4.4 EEG screen with high frequency noise, eyelid blinks and epileptiform events

(black arrow). ( c Adriano O. Andrade et al 2013) ...... 67

5.1 Typical 10/20 extended electrode scalp placement...... 75

5.2 Single trial, random subject, 64 channels in 2D and 3D (inset images); black

line is the mean across all channels...... 77

5.3 Mean across all trials, channels and subjects per group...... 78

5.4 Mean across only no-match trials, all channels and subjects per group. . . . 78

5.5 Mean across only match trials, all channels and subjects per group...... 79 xxii

5.6 Dipoles representing the mean across no-match trials for all HR subjects, for

each channel. Notice heat map of P300 response that corresponds to the thin

green strip around 300ms into the trial vs the same trial for LR subjects in

Figure 5.7...... 80

5.7 Dipoles representing the mean across no-match trials for all LR subjects, for

each channel. Notice heat map of P300 response that corresponds to the thin

green strip around 300ms into the trial vs the same trial for HR subjects in

Figure 5.6...... 80

5.8 Histogram and kernel density estimation for a single subject, 3 randomly

selectedno-matchtrials...... 81

6.1 All channels used in the 60-channel tests are labeled...... 89

6.2 Red highlighted channels included in the 8-channel tests...... 89

6.3 Red highlighted channels included in the 12-channel tests...... 90

6.4 ROC curves and AUC for match/no-match experimentation ...... 92

6.5 ROC curves and AUC for high-risk/low-risk experimentation ...... 93

7.1 Data partitions used for our original benchmarking (middle row) and trans-

fer learning experiments (bottom row). F1, F2, and F3 (up to %75 of the

remaining data) are sections of data used for fine tuning, leaving section T

(25%) for testing...... 103 xxiii

7.2 Part 1 of our baseline ConvNet: (conv1): Conv1d(64, 32, kernel size=(3,),

stride=(1,), padding=(1,)) (conv2): Conv1d(64, 32, kernel size=(7,), stride=(1,),

padding=(3,)) (conv3): Conv1d(64, 32, kernel size=(11,), stride=(1,), padding=(5,))

(mp1): MaxPool1d(kernel size=2, stride=2, padding=0, dilation=1, ceil mode=False)105

7.3 Part 2 of our baseline ConvNet continues with: Sequential( (0): Layer-

Norm((4096,), eps=1e-05, elementwise affine=True) (1): ELU(alpha=1.0)

(2): Dropout(p=0.5, inplace=False) (3): Linear(in features=4096, out features=1024,

bias=True) (4): LayerNorm((1024,), eps=1e-05, elementwise affine=True)

(5): ELU(alpha=1.0) (6): Linear(in features=1024, out features=128, bias=True)

(7): LayerNorm((128,), eps=1e-05, elementwise affine=True) (8): ELU(alpha=1.0)

(9): Linear(in features=128, out features=1, bias=True) (10): Sigmoid() ) . 106

7.4 A typical autoencoder ( c towardsdatascience.com). Input first passes through

the encoder to produce the representative code. The decoder, which has a

similar “mirrored” structure as the encoder, reproduces the original input

usingonlythecodeasinput...... 110

7.5 D is defined as the Euclidean distance between all outputs (F 1 and F 2) of

theSiamesenetwork...... 111

7.6 A typical Siamese network construct. The two identical subnetworks are

defined as a series of fully connected layers. Both networks accept input

from our convolutional layers. The outputs from the Siamese subnetworks are

feature vectors F1 and F2, which are further reduced feature representations. 111

7.7 ROC curves and AUC for match/no-match transfer baseline...... 112 xxiv

7.8 ROC curves and AUC for match/no-match pre-trained, fine-tuned, and with

weightfreezing...... 113

7.9 ROC curves and AUC for HR/LR baseline...... 114

7.10 ROC curves and AUC for HR/LR with Siamese network, no weight freeze. . 115

8.1 Below: 2D graph of a 1-second trial with 61 channels. Above: the same trial

across channels, giving us a 3D spatial perspective...... 129

8.2 A dipole heat map representing what our Siamese network learned over all

HR tasks. This image shows aggregated feature contributions (depicted

through IG) across all TP high risk (alcoholic) subjects averaged across the

entire length of a 1-second trial, showing which channels spatially contributed

negatively (blue) or positively (red) towards the prediction...... 134

8.3 A joint temporal dipole heat map representing what our Siamese network

learned over all HR tasks. This image shows aggregated feature contri-

butions (depicted through IG) across all TP high risk (alcoholic) subjects

through time over the entire length of a 1-second trial, showing which chan-

nels spatially and temporally contributed negatively (blue) or positively (red)

towards the prediction. Below the dipole maps we are able to follow the con-

tributions of each EEG channel through time...... 135 xxv

8.4 A temporal heat map representing what our Siamese network learned over

all HR tasks. This image shows aggregated feature contributions (depicted

through IG) across all TP high risk (alcoholic) subjects through time over

the entire length of a 1-second trial, showing which channels contributed

negatively or positively towards the prediction...... 136

8.5 A dipole heat map representing what our Siamese network learned over all

LR tasks. This image shows aggregated feature contributions (depicted

through IG) across all low risk (non-alcoholic) subjects averaged across the

entire length of a 1-second trial, showing which channels spatially contributed

negatively (blue) or positively (red) towards the prediction...... 137

8.6 A joint temporal dipole heat map representing what our Siamese network

learned over all LR tasks. This image shows aggregated feature contri-

butions (depicted through IG) across all low risk (non-alcoholic) subjects

through time over the entire length of a 1-second trial, showing which chan-

nels spatially and temporally contributed negatively (blue) or positively (red)

towards the prediction. Below the dipole maps we are able to follow the con-

tributions of each EEG channel through time...... 137

8.7 A temporal heat map representing what our Siamese network learned over

all LR tasks. This image shows aggregated feature contributions (depicted

through IG) across all low risk (non-alcoholic) subjects through time over

the entire length of a 1-second trial, showing which channels contributed

negatively or positively towards the prediction...... 138 xxvi

8.8 A dipole heat map representing what our binary (CNN) network learned

over all match tasks. This image shows aggregated feature contributions

(depicted through IG) across all TP match tasks averaged across the en-

tire length of a 1-second trial, showing which channels spatially contributed

negatively (blue) or positively (red) towards the prediction...... 138

8.9 A joint temporal dipole heat map representing what our binary (CNN) net-

work learned over all match tasks. This image shows aggregated feature

contributions (depicted through IG) across all TP match task through time

over the entire length of a 1-second trial, showing which channels spatially

and temporally contributed negatively (blue) or positively (red) towards the

prediction. Below the dipole maps we are able to follow the contributions of

each EEG channel through time...... 139

8.10 A temporal heat map representing what our binary (CNN) network learned

over all match tasks. This image shows aggregated feature contributions

(depicted through IG) across all TP match tasks through time over the entire

length of a 1-second trial, showing which channels contributed negatively or

positivelytowardstheprediction...... 139

8.11 A dipole heat map representing what our binary (CNN) network learned

over all no-match tasks. This image shows aggregated feature contributions

(depicted through IG) across all TN match tasks averaged across the en-

tire length of a 1-second trial, showing which channels spatially contributed

negatively (blue) or positively (red) towards the prediction...... 140 xxvii

8.12 A joint temporal dipole heat map representing what our binary (CNN) net-

work learned over all no-match tasks. This image shows aggregated feature

contributions (depicted through IG) across all TN match task through time

over the entire length of a 1-second trial, showing which channels spatially

and temporally contributed negatively (blue) or positively (red) towards the

prediction. Below the dipole maps we are able to follow the contributions of

each EEG channel through time...... 140

8.13 A temporal heat map representing what our binary (CNN) network learned

over all no-match tasks. This image shows aggregated feature contributions

(depicted through IG) across all TN match tasks through time over the entire

length of a 1-second trial, showing which channels contributed negatively or

positivelytowardstheprediction...... 141

A.1 Logistic Regression ROC curves and AUC for match/no-match experimenta-

tion: left assesses stability across training set size, middle assesses stability

across dimensionality, and right assesses HR/LR task where the algorithm

performed at its worst...... 166

A.2 Linear Discriminant Analysis ROC curves and AUC for match/no-match ex-

perimentation: left assesses stability across training set size, middle assesses

stability across dimensionality, and right assesses HR/LR task where the

algorithm performed at its worst...... 167 xxviii

A.3 Riemannian Minimum Distance to Mean ROC curves and AUC for HR/LR

experimentation: left assesses stability across training set size, middle as-

sesses stability across dimensionality, and right assesses match/no-match task

where the algorithm performed at its worst...... 167

A.4 Feed-forward Neural Network ROC curves and AUC for match/no-match

experimentation: left assesses stability across training set size, middle as-

sesses stability across dimensionality, and right assesses HR/LR task where

the algorithm performed at its worst...... 167

A.5 Convolutional Neural Network ROC curves and AUC for match/no-match

experimentation: left assesses stability across training set size, middle as-

sesses stability across dimensionality, and right assesses HR/LR task where

the algorithm performed at its worst...... 168

A.6 Gradient Boosting ROC curves and AUC for match/no-match experimenta-

tion: left assesses stability across training set size, middle assesses stability

across dimensionality, and right assesses HR/LR task where the algorithm

performed at its worst...... 168

A.7 Random Forest ROC curves and AUC for match/no-match experimenta-

tion: left assesses stability across training set size, middle assesses stability

across dimensionality, and right assesses HR/LR task where the algorithm

performed at its worst...... 168 xxix

A.8 Comparison of best performing traditional BCI classification with best trans-

fer learning model for HR detection on 8 channels across various training

proportions. The solid red is the Siamese model with no weight freezing

compared to the solid cyan representing the Riemannian Minimum Distance

to Mean. CNN and FFNN were provided for additional contrast...... 169

A.9 Comparison of best performing traditional BCI classification with best trans-

fer learning model for HR detection on 12 channels across various training

proportions. The solid red is the Siamese model with no weight freezing

compared to the solid cyan representing the Riemannian Minimum Distance

to Mean. CNN and FFNN were provided for additional contrast. Notice of

MDM shows slight improvement with the increase in channels...... 169

A.10 Comparison of best performing traditional BCI classification with best trans-

fer learning model for HR detection on 60 channels across various training

proportions. The solid red is the Siamese model with no weight freezing

compared to the solid cyan representing the Riemannian Minimum Distance

to Mean. CNN and FFNN were provided for additional contrast. Here you

will notice how RMDM becomes unstable and degrades in performance when

highdimensionalityisintroduced...... 170

A.11 Comparison of best performing traditional BCI classification with best trans-

fer learning model for match detection on 8 channels across various training

proportions. The solid red is the binary model with weight freezing com-

pared to the solid blue representing the our FFNN model. LDA and RF

were provided for additional contrast...... 170 xxx

A.12 Comparison of best performing traditional BCI classification with best trans-

fer learning model for match detection on 12 channels across various training

proportions. The solid red is the binary model with weight freezing com-

pared to the solid blue representing the our FFNN model. LDA and RF

were provided for additional contrast...... 170

A.13 Comparison of best performing traditional BCI classification with best trans-

fer learning model for match detection on 60 channels across various training

proportions. The solid red is the binary model with weight freezing com-

pared to the solid blue representing the our FFNN model. LDA and RF

were provided for additional contrast...... 171

B.1 Typical EEG recording of human brain...... 172

B.2 Changing the membrane potential for a giant squid by closing the Na gets

and opening K gates. Sanei and Chambers [197] ...... 173

B.3 (a) The human brain.(b) Section of cerebral cortex showing microcurrent

sources due to synaptic and action potentials.(c) A 4-second epoch of alpha

rhythm and power spectrum are shown. (Artech House [224]) ...... 173

C.1 The possible classifications of biosignals according to their (a) existence, (b)

dynamic, and (c) origin, with indicated heart rate fC, respiratory rate fR,

and additional information. (Springer-Verlag Berlin Heidelberg) ...... 177 xxxi

C.2 Pictorial outline of the decomposition of the surface EMG signal into its

constituent motor unit action potentials. (Adapted from De Luca et al.

1982a.) ...... 178

C.3 Several seconds of ECG recording...... 179

C.4 ECG of a heart in normal sinus rhythm...... 179

C.5 A sample GSR signal of 60 seconds duration ...... 179 CHAPTER 1

THESIS, MOTIVATION, CLAIMS, AND CONTRIBUTIONS

Brain-Computer Interface (BCI) is a rapidly evolving, multidisciplinary field that incorporates aspects of , electrical engineering, and computer science. BCIs capture, measure and record brain function via brainwaves and relate those measurements to some sort of external classification, action or task [117, 197, 224, 245]. Brainwaves are the byproduct of a functioning brain (i.e. your brain), and they’re typically measured and

recorded as electrical activity. These recordings are referred to as electroencephalography

(EEG). Refer to Appendix B for more details on EEG. Brainwaves contain modulations

encoded by the brain. Some modulations are performed purposefully by you. An example

of this would be performing an action such as hand movement or whistling a tune. Other

brainwave modulations are a result of involuntary brain processes, such as initial responses

to light, sound, or touch. BCI systems will capture these modulations and then decode and

translate them into tasks. Tasks are executed according to the specific design of the BCI.

Simply stated: BCI users may instruct computers and/or robots to perform specific actions through modulating their own brainwaves. To many readers this idea may seem too far-fetch, gimmicky, or sci-fi-like; however, the technology to support BCI does exist today, and it is rapidly progressing.

BCI depends on the ability to effectively identify important features related to brain- wave modulations in EEG. While all BCI have a similar foundation for capturing and interpreting brainwaves, their methodology and application vary widely. Building off of experience from the author’s prior work in EEG (see chapter 3), this study explores ways 2 in which novel application of deep transfer learning may help overcome shortcomings that currently inhibit BCI from becoming viable consumer products. So that you may better un- derstand our contributions, we provide some standard terminology in the following section.

Those familiar with the space may skip to section 1.2.

1.1 Terminology

Electroencephalograph: (EEG) contains recorded biosignals (see appendix C), and • they represent the signatures of neurological activity in the brain [197]; represents

brainwaves and components therein such as ERP, EP, P300, etc. (see below).

Event-related potential: (ERP) contains very small voltages generated in the brain • structures in response to specific events or stimuli. They are EEG changes that are

time locked to sensory, motor or cognitive events [216]. See section 4.1.

Evoked potential: (EP) contains the electrical signals produced by the nervous • system in response to an external stimulus. Sensory EPs can be recorded following

stimulation in any sensory modality, but visual EPs (VEPs), auditory EPs (AEPs),

and somatosensory EPs (SEPs) are most often used for clinical diagnosis and testing

[103, 104, 208].

P300 : The P300 component is a positive evoked potential occurring somewhere • between 250 ms and 500 ms after a novel stimulus (visual, tactile, or auditory) is

introduced. The P300 continues to be an important signature of cognitive processes

such as attention and working memory and of its dysfunction in neurological and

mental disorders [130, 171]. See sections 2.3 and 4.2. 3

P3a & P3b components: Both are P300 sub-components. P3a originates from • stimulus-driven frontal attention mechanisms during task processing, whereas P3b

originates from temporal-parietal activity associated with attention and appears re-

lated to subsequent memory processing. P3b is more prominent during a novel

stimulus, when that stimulus has the attention (or focus) of an individual subject,

whereas P3a is more prominent when the subject is disinterested in the novel stimu-

lus [171, 174].

Brain-Computer Interface: Slightly refined for this study, we recognize BCI to be • a system that enables its users to interact with external technology using nothing more

than their own brainwaves (i.e. EEG). To distill the term further, BCI is a system

designed to decode and execute actionable tasks from EEG. Tasks are related to the

design and purpose of the BCI; they take on meaning within the context of a specific

application (mobility, communications, diagnostics, etc.) [24, 85, 153, 219].

See chapter 2 for more details on BCI systems.

Steady-State Visual-Evoked Response: (SSVEP) is a VEP that corresponds to • high visual stimulus rates modulated above 6Hz. SSVEPs may also be referred to as

steady-state visual-evoked responses (SSVERs) [237, 250].

Sensorimotor Rhythm: (SMR) is an ERP that emanates from the sensorimotor • cortex region of the brain, which is the area associated with coordinating muscle

motion and feedback from that motion. BCIs based on SMR are typically designed

to assist with some sort of thought-based movement [255].

Active Oddball Response Technique: a technique where intermittently, an oddball • pattern is applied to each task (i.e. choice) in a BCI system; it is designed to elicit a 4

prominent P3a component. When this oddball pattern is applied to the actual target

(desired) choice (the relevant task), the subject’s brain produces a prominent P3b

wave [68].

Deep Transfer Learning: a form of domain adaptation referred to as transfer learn- • ing applied to deep neural networks that has recently received increasing attention

from computer science researchers; it has been successfully applied to many real-world

applications to date. Transfer learning relaxes the hypothesis that the training data

must be independent and identically distributed with the test data [218].

Interpretability: a methodology whereby complex machine learning (ML) models • might be made explainable to a human observer. Almost all nonlinear (and even many

linear) models have a certain black-box quality to them. We cannot easily see how

certain features in data are learned and weighted. Interpretability and explainablity

of ML algorithms have thus become pressing subjects of research seeking to know if

we can explain what a given model is learning [222].

1.2 Motivation

There are several key motivators behind this research:

1. To reduce the time and cost of bootstrapping a BCI system.

2. To support and enhance efforts towards developing more effective BCI interfaces.

3. To identify promising work in ML research that may carry this work forward well into

the future.

4. To improve usability: push the use of BCI further towards commercial and clinical

use outside of academic and research environments. 5

5. To further the advancement in drop-in actionable ERP detection irrespective of BCI

application.

Despite the overwhelming evidence demonstrating immense interest in BCI across a multitude of domains, as well as the technological advancement in hardware that supports it, consumer and clinical viability remains low at the time of this writing. There are many aspects to contend with that are buried deep within the very nature of EEG: it’s noisy, non- linear, highly non-stationary, not to mention the high dimensionality of the data [197,224].

These attributes represent difficult challenges for BCI classifiers to overcome. When we refer to EEG, we are also referring to multivariate (i.e. multichannel) time series data.

For a typical dataset there are between 8 and 128 channels of EEG recorded at a rate between 256 Hz and 2000 Hz. This spatial component adds much more complexity to the data because EP may not manifest in all channels; furthermore, EP vary spatially and temporally between tasks and users [46, 126].

The typical challenges in BCI (as in almost all areas of application that utilize EEG) are culling the features of interest from background EEG noise and the classification of brain wave patterns thereafter in near real-time (see section 4.3). This translates into slow performance [181]. In fact, the more accurate BCI is at predicting a user’s tasks, the slower the system becomes. Of course, the inverse to this problem is the faster a BCI system performs, the less accurate it is at predicting the user’s tasks; in other words, the more mistakes the system will make.

To further complicate matters, there are other factors that hinder the creation of fast and accurate BCI. These issues arise from how the interface works to elicit EPs from the user [50, 68, 127]: 6

Repetition blindness [198]: a second target may be missed if two identical targets in • a stream of non-targets are pulsed at intervals between 100 ms and 500 ms. Attentional blink [148]: manifests if intervals between two targets is less than 500 ms. • Habituation [182]: The P300 amplitude may decrease with repeated stimulus (also • referred to as user normalization). Target to target interval [83]: P300 amplitude is related to the interval between target • events.

Research in BCI interfaces aims to reduce these problems through better understanding of the physiological factors that influence the performance of BCI systems (attention and con- centration, motivation, control beliefs, visuo-motor coordination, etc.) [111]. Nevertheless,

BCI interface deficiencies remain in BCI, and they negatively impact the algorithms that perform EP classification.

BCI interface deficiencies combined with artifact, inherent noise and dimensionality of

EEG are causes for slow and inconsistent BCI performance and place a heavy computational burden on the classifiers; this makes BCI impractical for many applications. For example, a typical P300-based BCI speller manages a low system throughput resulting in 3-6 characters per minute with accuracy in mid 80%. One of the best performers to date claims around

18 characters per minute with 95.8% accuracy [225]; however, this technique has not found its way into the marketplace, and it’s performance gains are tied to complex interface techniques that are not easily separated from its specific implementation.

Now that hardware technology for capturing and recording EEG is both inexpensive and easy to acquire off-the-shelf, much interest and enthusiasm currently surrounds BCI, and it is being considered and studied for many uses. However, EP BCI is much too slow and inaccurate to be used in any practical way outside of academia and research. Specifically,

EP BCI requires improvements in two main areas, (1) the interface and (2) the classifiers. 7

Notwithstanding issues that persist with the BCI interface, EP-based BCI depends on the reliable detection of EP features in EEG. This study does not focus on the interface directly; our focus is on machine learning models that may compensate for some of the interface shortcomings as well as differences across users and tasks, effectively reducing these effects as a function of design. Many of the current techniques used to detect and classify ERP assume the importance and presence of certain attributes in EEG over others in the signal. Spatial and/or temporal attributes in EEG are either lost or greatly reduced during the application of common filters and preprocessing techniques that cull out hand- picked attributes.

In many respects it is difficult to understand what these classifiers are actually classi- fying, and we cannot say with certainty that the attributes these classifiers are attempting to detect are, in fact, optimal. In addition, none of the traditional classifiers in EP BCI offer scalability across users. Training must be performed on each user multiple times throughout a relatively short time span as users’ normalize to the stimuli offered up through the interface. This technology cannot be offered up to consumers until more effective improve- ments are worked into the domain that turn BCI into a reliable, uncomplicated and useful tool.

Thesis: Coupling deep transfer learning models with interpretable visual explainers may aid in the creation of BCI-specific classifiers that can generalize well across users and tasks

(i.e. context); these models may contribute to the creation of robust, practical BCI systems that exhibit good performance and reliability for use in critical assistive, as well as clinical, settings. 8

1.3 Claims

The purpose of this dissertation is to advance the state of the art in classification algo- rithms by uncovering algorithms that show potential for good performance and scalability for multichannel, multi-user, unprocessed electroencephalography. In this work we discon- nect EP-related tasks from the specific BCI, the underlying physiology (i.e. the user), and its application. We focus on improving classification for embedded actionable ERP from raw EEG. This is the first work to develop deep transfer-learning based approaches on

learned attributes that scale across users and tasks irrespective of specific BCI application.

In short, we want the classifier to determine what’s important and what is not.

Furthermore, instead of trusting black-box approaches which mysteriously improve

performance, we wish to discover what our classifiers are learning. Knowing what a classifier learns may help aid in building and configuring BCI systems that perform optimally within a specific BCI context. These improved classifiers may displace current classifiers in many popular ERP BCI to provide increased system throughput and accuracy.

Claim 1. To date there has been limited investigation towards transferability between sub- jects and tasks for BCI. This is worth investigating further.

Reason 1: Success in other domains (examples are cited in this work) provides insight • into approaches we may explore for BCI. Reason 2: To make BCI a practical technology for everyday use, we need to explore • both task and subject transfer. Reason 3: Classifiers that can generalize well across subjects are important elements • for supporting clinical diagnostic tools as well as actionable BCI systems.

Claim 2. Transfer learning may facilitate the discovery of more effective models that work

with less training data and require less time to train within context of a given task. There 9 is huge precedent for models that operate on learned features as opposed to hand-crafted features common in traditional BCI classifiers today:

Reason 1: For some domains, especially with spatial/temporal locality, these mod- • els (which “learn their own features”) yield state-of-the-art performance. EEG data

contains both temporal and spatial locality. Reason 2: Intuitively, hand-crafted features are giving up a degree of performance • presuming we can learn it through some form of optimization algorithm. Hand-crafted

features tend to obliterate (oftentimes unwittingly) potentially important information. Reason 3: More scalable train/deploy considerations. We may rapidly deploy pre- • trained classifiers for use across BCI applications and modalities.

Claim 3. Models that examine just a detected particulate of the EEG, e.g., P300, are potentially leaving a lot of performance on the table if other less-prominent, but important patterns are contained in the signal.

Claim 4. There is a need for investigating classifiers which operate on learned EEG features while considering an entire trial sample. This could negate the need to manually chop trials down to certain time-spans for training purposes. Ideal classifiers should have the ability to narrow in on the important temporal aspects during the training process.

Claim 5. It is important to explore whether machine-learned features make sense; this may

help with building and tuning BCI systems for specific use cases.

1.4 Investigatory Items

Questions this study attempts to answer:

Query 1. How well can we apply deep transfer learning: 10

To compensate for differences across users of BCI? • To perform well despite “normalization” effects in individual users? •

Query 2. How may we leverage learned cross-task features (i.e. the building blocks) in

EEG for BCI applications (e.g. match vs no-match vs alcoholic vs control)?

Query 3. Can we reasonably know what EEG features are learned by our models?

Query 4. What available dataset may assist with these efforts?

1.5 Contributions

This work makes the following contributions:

Novel transfer learning approaches towards generalization techniques that allow for • simplified and shortened classifier training specific to EEG. Evaluation of traditional BCI classifiers with respect to performance on raw EEG data • (i.e. no hand-crafted features). Evaluation of classifiers with respect to generalizability across subjects and tasks. • Evaluation of minibatch-based nonlinear model performance aimed at the eventual • explosion in size and availability of EEG datasets. Novel approach to visualization-assisted BCI development via interpretability tech- • niques. CHAPTER 2

WHITHER GOEST THOU, BCI?

BCI is not a new idea, and this chapter is a review of its background. The origins of BCI go back to the early twentieth century when it was first discovered that the human brain generated electrical current; this led to the development of EEG in 1929 [227]. In 1968,

Kamiya demonstrated that humans can indeed purposely modulate certain brainwaves [230].

The idea of BCI was first introduced by Vidal in 1971 as Brain-Computer Communication

[238].

There is a rich and broad field of study and application for BCI. There are many forms of BCI in existence. Some BCI are designed to allow people with disabilities to communicate using on-screen words and sentences as well as auditory speech [37, 44, 231, 249]. Other systems are designed to extend mobility to people who otherwise could not move from place to place of their own accord [92, 106, 183, 259]. Still, other systems are designed to interface with prosthetic limbs that act as arms or legs [42, 53, 66, 166].

Then, there are BCI designed around entertainment (gaming) and creativity (such as composing music, drawing or ”painting images”) [31,69,70,105,144,173,177]. To afford older or disabled individuals a certain degree of independence, there are BCI for changing channels on the TV, browsing the internet, and adjusting environmental controls [10,34,77,90,260].

There is much research interest in exploring BCI for assessing mental state, including level of attention, mental workload, engagement as well as mindfulness, well-being, and quality of sleep [30, 58, 95, 143, 180, 200, 247]. Research targeting clinical uses for BCI are plentiful. Areas of interest include identifying pathological brain function such as cognitive 12 limitations, Alzheimer’s, schizophrenia, depression, PTSD, ADHD, and [48, 71, 79,

80, 102, 110, 175, 176, 180, 200]

2.1 A New Era of Technology: Fuel for Renewed Interest

There was a period in time that research into BCI had waned; this was due to inade- quate and expensive technology that rendered anything practical in BCI all but impossible outside of research settings [159]. For example, even now the preferred way to record EEG is through the use of wet electrodes, which are cumbersome and inconvenient to prepare for use in any practical setting [257]. Excluding intracortical or subdural electrodes (implanted into or onto the brain), wet electrodes are preferred in research due to the lower impedance over (newer) dry electrodes; however, recent advances in dry electrode technology places its performance on par with wet electrodes. In addition, until recently, the equipment used to amplify and record EEG was too bulky and prohibitively expensive for use outside of clinical and research settings [45,203]. Today, virtually none of these particular roadblocks remain.

Technology that lends itself to BCI applications is now offered by a number of open source and commercial companies. A number of existing papers on the performance of consumer-grade EEG products vs research and medical grade agree that the gap between medical-grade and consumer-grade technologies is closing, and it is clear that consumer- grade products provide a path to commercially viable BCI technology that is applicable to every-day life for individuals who need it. Hardware technology to support BCI exists, is readily-available, performs well, and costs considerably less than it did even five years ago [7, 49, 74, 146, 146, 169, 196, 232]. 13

Research interest in BCI is growing exponentially [142, 246]. This growth is directly related to recent advancements in technology. These advancements have aroused renewed interest in BCI along with new ideals for its applications. BCI is now entering an advanced stage in its adaptation and development [152].

2.2 BCI: The ”Interface” in Brain-Computer Interface

So, how do users interact with BCI? Before we explore this question, there are a few things we need to know about BCI design. There are two types of EEG-based BCI systems in use today: (1) those that use event related potentials (ERPs) and (2) those that use multiple sensor EEG activities; this dissertation does not cover the latter. The latter is considered a more comprehensive approach because it can discover and use signals from specific regions of the brain, and may not require a specific stimulus. Researchers are also combining these systems to explore ways of improving BCI with hybrid approaches

[90, 138, 249, 252]. Regardless of the approach, commands for these systems are embedded in either amplitudinal EEG data or in EEG frequency domain [224] [197]. This will be a cursory overview since our study does not directly rely on the type of interface (or the many variation thereof) used in soliciting ERPs. The purpose of this section is to provide a complete picture of an end-to-end BCI system so that the reader can better understand the context for this study.

There are several popular BCI systems that exploit various components in brainwaves for the purpose of encoding and decoding actionable system commands. This study is not particularly focused on any specific BCI system; the algorithms and techniques we explore are thought to be suitable for various event-related potentials (ERPs): potentials that are a 14 result of some internal or external stimulus (sensory or cognitive) that are time-locked with that stimulus. Section 4.1 provides the reader with more details on ERP.

A popular ERP is the P300 (or the P3) wave, and it is a component of a resultant ERP in the brain’s parietal lobe; it is further explained in Section 4.2. Many BCI systems have been designed around P300. P300 BCIs tend to exhibit several attractive features, such as less required training (compared to systems based on other ERP components), speed, and ease of use. Since the P300 wave can be used for a variety of sensory modalities, research continues with P300 BCI systems [68]. Because P300 is a popular ERP, the following section will review how the P300 BCI interface works to elicit ERPs from users. The section gives the reader an applied example of a specific type of BCI system.

2.3 P300 BCI

P300 is a relatively late-evoked response that the brain produces when stimulated through any sensory input such as vision, tactile, or auditory. This particular brain response is called, ”P300” because it is a positively peaked (P) wave that forms around 300 ms after some stimulus is introduced and partially processed by the brain (see Section 4.1); we say that the P300 has a 300ms response latency. We can take advantage of the nature of this

particular response in such as way so as to enable an individual to easily encode instructions

with it that a computer system may decode into actionable commands; tasks involving such

things as mobility, spelling, and gaming may be performed with a P300 BCI, to name a

few. Clinical use of P300 BCI is also possible. We may also examine the modulation of the

P300 wave to discover important properties related to the system (person’s brain) it came

from. Things like mental state, awareness, task comprehension, and much more. 15

Because our study focuses on time-locked ERP, we shall highlight the P300 interface here. All BCI interfaces (the part of the BCI that users directly interact with) are designed to effectively elicit pronounced ERP waveforms from the user. The P300 is modulated by what is referred to as the brain’s novelty response in addition to task relevance and engage- ment [8]; therefore, the P300 interface is able to exploit this system state (i.e. the brain’s function state) regardless of stimulus modality (i.e. touch, auditory, or visual) [9]. All P300

BCI interfaces are based on the ”active oddball” response technique, modality notwith- standing. The typical P300 BCI interface is based on the row/column (RC) paradigm first introduced in 1988 by Farewell and Donchin in their original paper [66]. Since most P300

BCI user interfaces (those based on visual stimuli) revolve around the P300 speller, or heavily borrow from the P300 speller, we shall use the speller interface to demonstrate the concept. Keep in mind that interfaces based on audio and tactile work in a similar way using similar patterns, but are designed to work best with senses other than vision to evoke the P300 wave. To better understand what evoked responses are, and for a more in-depth explanation of P300, please see sections 4.1 and 4.2.

Continuing with the P300 speller interface example, the typical RC interface consists of a grid with rows and columns of letters where a consistent visual ”flashing” pattern is applied to the entire grid by row and/or by column (see Figure 2.1). The number of columns and rows can and do vary (6x6, 8x9, etc.). Each letter will receive an ”oddball” pattern within its respective row/column, making it briefly stand out from the rest of the characters if the user is actually looking at that specific location (staring at the desired letter in the grid). When this oddball pattern is applied to the target letter, the P300 response is produced by the brain, detected and processed (decoded) by the BCI, and that letter is chosen as part of the current word being spelled out by the user. The overall performance 16 of this system amounts to about 5 cpm (.5 bps) [43]. Note that the older UIs, and those that are based on the classical UI, tend not to use colors other than whites and grays.

If you now imagine changing the speller’s grid from being a matrix of letters to being a matrix of command selections, then you can see how this paradigm can be mapped onto

applications going beyond spelling out words. The task that is encoded in the user’s P300

response is decoded by the BCI as an actionable task (i.e. command); that command is

then carried out by the system. In the case of the speller, a letter is displayed on the screen

as part of a word or sentence. The letter that appears on the screen will be the action

carried out by the BCI in response to the user’s modulated P300 wave.

Figure 2.1: (Top) The RC speller highlights a whole row or column at once with 6 symbols each. (Bottom, left) The SC speller highlights each character individually. (Bottom, right) Electrode positions according to the international 10/20 electrode system used for the EEG measurements. The head is viewed from above and the nose points to the top of the page. Source: [86].

In recent years there has been growing interest in finding better ways to use stimuli in

P300 BCI interfaces. While the P300 can be elicited and detected irrespective of the stimulus modality, it has been shown that visual stimuli is the most effective means [9]; nevertheless, research seeks ways to optimally apply P300 BCI for users with impaired vision, or who 17 cannot control gaze [107]. Research continues to focus, in large part, on improving the visual P300 BCI UIs so as to elicit ideal modulated P300 waveforms; this is being done by exploring various ways of improving the UI. Some of these approaches involve varying the dimensions of the typical ”speller” grid, varying the distance between symbols, stimulus frequency [149] applied to symbols, and utilizing colors as effect means of stimulus [195].

Still, other research considers alternative paradigms [67] and hybrid approaches [254] as well. It has been shown that providing BCI interfaces with more interactive interfaces may improve concentration, thus performance for users [56]. Below is a list of P300 BCI interfaces that recent works have focused on:

The single character UI (SC): Works similarly to the RC, but for a single character at • a time, rather than flashing an entire row or column. RC tends to elicit better results

and faster throughput [86].

The checkerboard (CB) UI : designed to improve both user accuracy and system • throughput, the screen is arranged in a checkerboard pattern with similar visual pat-

tern regularization to that of RC. Results show a marked improvement [226].

The color checkerboard (CCB) UI : Combining the checkerboard paradigm (above) • and a color paradigm seems to further improve results over checkerboard alone [194].

Text on 9 keys (T9): The T9 approach uses a 3x3 matrix with each key representing • one of 3 characters - similar to a mobile flip phone when texting. Combined with a new

random forest P300 classifier, the paper claims a 51.87 percent over the traditional

P300 speller [4].

Using faces and other image stimulus: Using imagery as stimulus is showing promise. • For example, slashing famous faces transparently superimposed of characters outper- 18

forms standard P300 spellers [108, 113]. Furthermore, combining faces with colors

seems to enhance this approach [128]. Along these lines, it has been shown that, in

general, object cues could be more effective over words and letters [55].

2.3.1 Trends in P300 BCI Application

P300 BCI is receiving a lot of renewed attention. There exists many proposed uses for P300 BCI, and it has been ap- plied to many BCI domains to date. While P300 BCI can potentially become exceedingly practical for various uses, its full potential appears as yet untapped. It should be men- tioned here that while there are a number of applications for

P300 BCI (many still in research stages), P300 BCI is gener- ally applied towards language-based communication through electronic devices.

BCI for mobility and electronic prosthetics are typically Figure 2.2: P300 BCI wheelchair prototype. Source: [183] built around Steady-State Visual-Evoked Responses (SSVER) and Sensorimotor Rhythms (SMR) as these brain responses tend to be better suited, thus more practical for continuous motion interfaces and related tasks (see Section 4.1). The sections that follow briefly explore some of the popular trends for P300 BCI.

2.3.2 Mobility

As far as we can tell, there are no actual mobility devices in mainstream use today that

are based on P300 BCI; however, prototypes have been produced and research remains active 19 in this area. Most of the literature is focused on the feasibility of P300 BCI wheelchairs for would-be users with physical disabilities. There are several proposals that utilize a simplified version of the classic visual P300 speller interface (see 2.3.8). Another interesting approach relies on a tactile P300 interface. Of course, there are some proposals where P300 is included as a part of a hybrid system. Overall, it seems that P300 BCI research in this area is waning in favor of SSVERs and SMR-based BCI for these tasks. In spite of this fact there are some notable papers that incorporate P300 BCI worth reading on the subject:

Controlling a wheelchair using a BCI with low information transfer rate: A prototype • built to utilize a simplified visual interface where preprogrammed destinations may be

selected from an on-board portable computer. The system also incorporates simple

stop and go commands as well [183].

Toward brain-computer interface based wheelchair control utilizing tactually-evoked • event-related potentials: Technology depending on visual or auditory input may not

be feasible as these modalities are dedicated to processing of environmental stimuli.

Hence, the researchers validated the feasibility of a BCI based on tactually-evoked

ERP for wheelchair control [106].

Wheelchair control by elderly participants in a virtual environment with a BCI and • tactile stimulation: Typically BCI relies on intact vision. Such BCI are impossible

to be operated by end-users with impaired or lost vision... For such end-users BCI

controlled by means of tactile stimulation may constitute a viable alternative [92].

Control of a Wheelchair in an Indoor Environment Based on a BCI and Automated • Navigation: ...a wheelchair based on unstable and noisy electroencephalogram signals

is unreliable and generates a significant mental burden for the user. A feasible solution 20

is to integrate a brain-computer interface (BCI) with automated navigation techniques

[259].

2.3.3 Electronic Prosthesis

Moving electronic prosthetic limbs (arms, hands, legs, etc.) via BCI are mostly the domain of SMR BCI, which utilize motor imagery (MI) potentials. Nevertheless, this fact does not preclude the application of P300 BCI here. Just as with mobility, there is research seeking to understand the effectiveness of P300 as applied to electronic prosthesis. While it may not seem likely that P300 BCI will provide an optimal approach towards controlling prosthetic limbs, it could simply be that an applicable method has yet to be discovered.

Control of a 9-DoF Wheelchair-mounted robotic arm system using a P300 Brain Com- • puter Interface: Initial experiments: A wheelchair-mounted robotic arm (WMRA)

system was designed and built to meet the needs of mobility-impaired persons with

limitations of upper extremities, and to exceed the capabilities of current devices of

this type. A P300 Brain Computer Interface (BCI), the BCI2000, was implemented

to control the WMRA system. It should be noted that the particular BCI system in

this case caused up to a 15-second delay between command and control of the arm -

too much to be practical [166].

Reaching and Grasping a Glass of Water by Locked-In ALS Patients through a BCI- • Controlled Humanoid Robot: Four ALS patients and four healthy controls were re-

cruited and trained to operate this humanoid robot through a P300-based BCI. A few

minutes training was sufficient to efficiently operate the system in different environ-

ments. Inter-subject variability in EEG-BCI performance has been reported (Guger

et al., 2012), but the relationship with the clinical or demographic variables is still 21

not clear. The publishers did note something interesting... Furthermore, performance

remained quite stable through sessions (Sellers et al., 2006a). That is, users do not

develop a larger P300 because of training, nor exhibit a decline resulting from habit-

uation, which can occur in P300 experiments without feedback (Ravden and Polich,

1998) [209].

Assistive robot operated via P300-based brain computer interface: Researchers present • an architecture for the operation of an assistive robot finally aimed at allowing users

with severe motion disabilities to perform manipulation tasks that may help in daily-

life operations. The motion of the manipulator is controlled relying on a closed

loop inverse kinematic algorithm that simultaneously manages multiple set-based and

equality-based tasks. The robotic platform has to be capable to interact with the

environment and with the user, thus a system is needed to make the robot

aware of the position of the user and of the possible objects [15].

2.3.4 Gaming

Similar to why P300 BCI is not best suited to continuous motion tasks such as mobility and robotic arms and extensions, P300 BCI has not proven to be very effective in the action games genre [144]. However, P300 BCI seems nicely suited for several game genres such as strategy (Chess), adventure (puzzle-solving), role-playing (RPG), and puzzle games (Tetris,

Hex, etc.) [105, 144]. Since many games in these genres do not require a time component, they may be enjoyed by players even with slow BCI throughputs. For many users who have physical disabilities (especially those with limited mobility or are shut-ins), gaming through P300 BCI can serve as a way to engage the on-line gaming communities or to enjoy a simple game of Solitaire by one’s self. A related benefit we stand to gain through 22

P300 BCI gaming: a better understanding of engagement and attention levels during game play [105]. A simple gaming interface might be a good way to train and calibrate a general

P300 BCI system for a particular user. Chapter 5 explains how we may take ERPs elicited from activities under one context and infer interesting things from them that may map onto activities under a different context.

2.3.5 Home Assistance and Environmental Control

P300 BCI is a mechanism through which those with disabilities may gain control of and interact with the environment around them [10, 260]; P300 is well-suited to such tasks. One way to increase independence for such individuals is to allow easy access to certain environmental controls in the home, thereby reducing the requirement for human assistance. For example, the ability to dim lights, turn lights on or off, control the television, door locks, and the thermostat may allow those with disabilities to create a comfortable environment for themselves while affording a high degree of independence [19, 34].

2.3.6 Assessing Attention State and Mindfulness

This topic is of interest in many fields mainly because it may help guide approaches towards treating disorders related to attention and mindfulness [58] as well as cognitive [95]. This problem domain is challenged with assessing and classifying the quality or state of consciousness in humans.

There are many useful applications for reliable intelligence here: discerning qual-

ity sleep and related disorders [247], identifying PTSD [47, 80], overall brain health, early

detection of such as Alzheimer’s disease [175,176], ADHD [48], and schizophre-

nia [71, 102, 110], monitoring cognitive workloads [200], assessing attention levels towards 23 improving effective learning (especially for e-learning) in the classroom [79, 180] and re- lated attention-aware systems (AAS), and improving the brain-computer interfacing (BCI)

[25, 155].

2.3.7 Other P300 BCI Uses

Among some of the lesser studied applications for PC300 BCI worth mentioning here are musical composition [19, 173], ”brain painting” (visual art) [19, 31, 87], web browsing

[77, 90, 145], assistive devices such as this LEGO device for turning pages in a book [42].

2.3.8 Spelling and Communication Systems (P300 SCS)

The P300 speller (and similar applications) is among the oldest and most popular

applications for the P300 BCI [38], and there are a variety of implementations to date.

The P300 speller paradigm has been, and still is, the standard benchmark for P300 BCI

systems [68,225]. It is also the chosen paradigm by which we will measure our own work. The

idea behind the speller is to simply allow the user to spell out words and sentences that are

then displayed on screen and/or sent to voice synthesizers for audio output. These spellers

are popular subjects for both communication purposes as well as for virtual keyboards for

general computing tasks.

As previously mentioned, there are individuals who live with a variety of physiological

issues which might preclude them from using P300 BCI visual systems. Therefore, it stands

to reason that tactile and auditory approaches are also being explored and applied to P300

BCI [33,75,107,231]. Furthermore, it makes perfect sense that systems utilizing combinatory

stimuli are also part of particular interest [23, 193, 253]. 24

To conclude this section, we can see that there is much interest in BCI, and BCI has the potential to improve the lives of many individuals who live with physical disabilities as well as those who don’t. While P300 BCI can be used for both mobility and prosthetic applications it will probably take a backseat to other BCI systems that show more promise in those areas, such as SSVER and SMR. Nonetheless, there are plenty of applications for which P300 BCI is well suited. Users may access environmental controls, lights, doors, and other electronic equipment via P300 BCI. In addition, we can enable individuals to communicate through P300 BCI spellers and synthesizers. P300 BCI has the potential to support and enhance creativity and artful activities for those who cannot use their limbs for such tasks. Individuals may engage with the outside world through BCI browsers and gaming interfaces.

2.4 Assumptions, Limitations, and Delimitation

It is necessary here to pro- vide a bit of clarification on the stricter definition of BCI this study recognizes. From a general perspective, BCI is composed of software and hardware that work together to connect its users to Figure 2.3: Brian-Computer Interface c ”The Brain-Computer Interface Project” http://www.ece.ubc.ca/~garyb/BCI.htm. some sort of system that interprets their brain activity. The definition and application of BCI varies, and depends on whom you might ask. There is the definition originally put forth in a 1971 paper, when the term ”Brain-Computer Communication” 25 was first coined by Jacques J. Vidal; he described it as ”...utilizing the brain signals in a man-computer dialogue...” and ”...that is, as a means of control over external processes,

such as computers or prosthetic devices” [238]. This definition is still more or less valid;

however, it is essentially a subset of what the term encompasses today.

A modern, more accepted definition was set forth in 2012 by Wolpaw & Wolpaw; the

authors define BCI as ”a system that measures central nervous system (CNS) activity and

converts it into artificial output that replaces, restores, enhances, supplements, or improves

natural CNS output and thereby changes the ongoing interactions between the CNS and

its external or internal environment” [245]. This study does not directly focus on the

very broad definition of BCI, nor does it address the many variations and uses. Instead,

this study focuses predominantly on actionable tasks encoded in certain types of human

brainwaves; however, there is nothing specifically in this work that would preclude it from

applying to the broader sense of the definition.

Slightly refined for this study, we recognize BCI to be a system that enables its users

to interact with external technology using nothing more than their own brainwaves. To

distill the term further, BCI is a system designed to decode and execute actionable tasks

from EEG. Tasks are related to the design and purpose of the BCI; they take on meaning

within the context of a specific application.

Some examples of tasks might be moving objects on a screen, spelling words, or

generating sentences. For example, a very simple BCI speller, which allows its user to spell

and communicate with individuals, might consist of tasks that spell out symbols, letters

and/or sentences onto a screen - these actions may also involve sending output directly to

a speech synthesizer. Tasks may be physical in nature, such as moving a prosthetic arm

or leg, or controlling a wheelchair or drone. Finally, a task may take on a more abstract 26 definition in that it may not be directly related to anything requiring execution, rather it is merely detected and classified. Each type of task requires a BCI specifically designed for and applied to its execution. The execution of a task is an action that is taken on behalf

of the user.

2.4.1 Electroencephalography, Nothing More

The work presented herein could possibly be applied to other forms of recorded brain

activity in the context of BCI; however, EEG is the data this study revolves around, and

it is referenced heavily throughout. It is beyond the scope of this dissertation as well as

the author’s area of expertise to write with any authority on the matter of the underlying

physiology of EEG; however, for those interested in learning some basic details about the

physiology behind EEG, additional information and authoritative references are supplied

in Appendix B.

This study focuses on EEG gathered through external means. That is to say we

have not directly worked with EEG collected through the use of intracortical or subdural

electrodes during this study. These methods require for permanent or temporary

placement into or onto the brain itself. Moreover, they are typically used in medical proce-

dures for research, diagnostics or treatment involving individuals with Parkinson’s, epilepsy,

or other neurological disorders [172,189,212,228,248]. Since the main focus of this study is

improving BCI for general consumer use, it hardly makes sense to require that you receive

major surgery before using it.

The EEG signals the author focuses on are specifically categorized as electrical biosig-

nals, or multivariate bio-electrical time [series] data. To ensure that this research is both 27 practical and effective the author has narrowed the scope to this specific physiological signal type and the narrowly associated topics as described.

A distinction (or rather the lack thereof) should be noted here. This dissertation may on occasion use a slightly different term when referring to ERP, ”evoked potential” (EP).

Oftentimes EP and ERP are used as synonymous terms in high level discussion; however, there is an important distinction to highlight here. An EP often refers to a time-locked potential, and as such, an EP is typically phase-locked to the onset of an external stimulus or event [224]. ERP can refer to potentials not time-locked to the stimulus. Technically, what this means is that averaging of a signal usually brings out EP while hiding ERP.

Unless we refer to this specific situation, EP and ERP are synonymous terms in this study. CHAPTER 3

PREVIOUS WORK: CLASSIFYING OSCILLATORY BRAIN

FUNCTION

The research summarized in this chapter represents a multi-year effort that high- lighted the importance of finding better features in EEG signals, as well as the significant amount of effort it takes to find and validate new hand-crafted features. It provides both a background on how hand-crafted features are developed and motivation why we want to move to machine-learned features.

This technique was applied to several of the author’s published works [122, 123, 125,

151]. Dubbed neuralClustering, this approach provided a platform to explore a wide range of application for EEG feature classification. In addition, this work provided an approach towards useful visualization tools dubbed the Invaluable Electroencephalographic

Research Software Suite, or IERSS (pronounced iris. IERSS proved useful for, not only training simple linear classifiers that were good performers, but also to gain insight into how EEG changes with different mental tasks [150]. The new visualization techniques in

Chapter 8 are an extension from our learning from these earlier works.

Together, IERSS and neuralClustering were applied to research that explored iden- tifying pathological brain function, changes in mental workload, and monitoring patients in critical clinical care settings. The overarching theme focused on identifying interesting attributes embedded in EEG and then exploring how simplified linear methods perform on classification with custom hand-crafted features. These studies utilized large datasets containing contiguous, 30-day recordings of rat brain activity as well as human datasets. 29

This research succeeded in demonstrating that the dimensionality of EEG analysis may be significantly reduced to provide a simpler approach to oscillatory brain function classifica- tion. When compared to similar notable studies at the time [3, 78, 109, 205], our results yielded better sensitivity and overall precision.

After culling usable feature vectors from EEG data, applying two, forward-connected, single-layer, 2-dimensional perceptrons to classification proved successful. We were able to identify, with a high degree of sensitivity and precision, oscillatory brain activity within the time domain of EEG signals. Furthermore, once trained, this system is capable of automatically identifying oscillatory features hidden in vast amounts of EEG data.

3.1 EEG Feature Generation and Examination

To the trained eye, most seizure activity is clearly delineable from artifact and normal activity in an EEG recording. In most cases, clinicians can easily point out and identify abnormal brain patterns by simply observing an EEG recording in its native time domain format; however, it may surprise many to learn that there exists no reliable automated ma- chine method for identifying pathological brain function such as is present during seizures.

While much research in this area has brought to light possible ways to automate the iden- tification of abnormal brain function, there is no tool in clinical use today that does this.

In previously published work on machine learning related to pathological EEG pat- terns, the author of this dissertation, along with Dr. Andrew White and Dr. Rory Lewis, identified several challenges to existing approaches to automated epileptiform identification.

First, much EEG data contain artifacts and noise throughout the recording, 2) in the case of multi-channel data, the technique for consistently tracking correlation among channels between ictal periods is non-trivial due to the fact that not all seizures progress the same 30 way between interictal periods, and 3) there is disagreement among epileptologists as to what actually constitutes a seizure. To simplify the problem, and to overcome the com- plexities therein, the authors devised a technique that greatly reduced the dimensionality of the problem called, neuralClustering [124] [123].

3.1.1 Theory for Observable Seizure Progression in EEGs

Early experimentation led to the idea that by culling out the right attributes from

EEG signals, an approach could be devised that made discretely quantifying differences in

pathological oscillations vs normal oscillations a relatively simple task. Furthermore, this

information could be obtained while remaining within the time domain of the signal. This

hypothesis was based largely on the discovery that artifact is randomly distributed while

normal neural activity remains relatively stationary throughout the signal [123].

Figure 3.1: EEG showing onset of Seizure. Note the drift in amplitude and frequency from left to right as seizure progresses. Frequency drops while amplitude increases - this is typical manifestation of neuronal hyper-synchronization.

The relatively stationary nature in healthy EEGs is largely due to consistently inco-

herent activity between healthy functioning neuronal groups activating independently from

other neuronal groups. This is manifested in the EEG as higher frequency, lower ampli-

tudinal data throughout the majority of the recordings. The inverse (pathological) was

understood to be hyper-synchronization taking place between neuronal groups, while man-

ifesting as a systemic frequency drift with increased amplitudinal strength in EEGs during 31 epileptiform activity (see Figure 3.1). Therefore, it was reasonable to assume that a clearly delineable pattern should emerge above all other information during ictal periods [124] [123].

Evidence for this theory was originally glimpsed using Matlab, which seemed to con-

firm the idea of clustered movement during seizure; however, this was done on a small scale due to the vast amount of data present in the EEGs. In order to process many gigabytes of data, a custom software prototype was incrementally created for the task. Henceforth, each idea presented in this paper was incrementally implemented by the author using C# in a software project specifically created to assist with this research. After implementation and application of the neuralClustering algorithm, clustered movement was positively observed during seizure; the neuralClustering algorithm now provides us with discrete attributes from

EEG recordings.

3.1.2 The neuralClustering Algorithm

While neuralClustering cannot directly provide classification of neuronal activity, it provides features whereupon classification may be based. This technique was devised while searching for an efficient and uncomplicated way to reduce the dimensionality of EEG data for classification purposes [123]. neuralClustering brings out both spatial and temporal attributes from the EEG signal, all while staying within the signal’s time domain. This process involves several steps.

3.1.2.1 Signal Bisection

A key function to initiating the neuralClustering algorithm is the splitting of the signal, or signal bisection. Bisecting the signal allows the creation of a baseline that provides context for what might be happening in the signal at a given point in time. It allows the 32 detection of amplitudinal and frequency drift from within the time domain of the signal.

For this research, the signal was bisected using a mean line constructed by averaging the signal every n sample points, where n is equal to half the signal’s sampling rate. The

EEG signals used here were sampled at 1000Hz; therefore, the mean line was put together

piecemeal using averages over every 500 points in the signal. Each joint was interpolated

y1−y0 for smoothness with y = y0 +(x x0) x −x where, y is the amplitude and x is time − ×  1 0  which smooths the line function by eliminating extreme slopes (see Figure 3.2).

Figure 3.2: Section of EEG bisected with mean line (in red) created using line segments averaged over 500 sample points, connected and interpolated.

3.1.2.2 Signal Preparation

The neuralClustering algorithm should be able to pull important features from the

EEG signal without requiring any filtering or other preparation. Therefore, filtering or other preprocessing is reserved for signals that may have unusually high background noise or artifact (spikes in not related to events of interest). Depending on the origin of the EEG recording, there may be a steady background noise in the 50Hz (Europe) or 60Hz (US) regions. This noise does not interfere with the neuralClustering because it is well above the bandwidth of interest. Note that the data used in this research did not require any filtering. 33

Table 3.1: Filters and Techniques: kernels applied with various kernel sizes. In the case of the windowed- sinc filters, they worked best using larger kernel sizes, but performance was poor; these filter kernels are Gaussian, and therefore require higher fidelity, and best performed using frequency convolution through FFT [206].

Kernel Kernel Size Notes Moving Average 15 and 51 Fast, simple and effective. Cubic Spline 5 and 100 Reconstructs entire signal by substituting cubic spline segments. Windowed-Sinc: Blackman 50, 200 and 500 Poor performance in time domain, fc =.14 (α range) good filter articulation above 500 points. Windowed-Sinc: Hamming 50, 200 and 500 Poor performance in time domain, fc =.14 (α range) good filter articulation above 500 points.

After extensive experimentation with various low-pass filters and techniques, it was determined that filtering in this instance did not improve upon the algorithm’s ability to effectively pull out the attributes used for classification. In fact, if the filter created a signal that was too smooth by eliminating high frequency noise above 13Hz, the seizure attributes were actually more difficult to discern from artifact. This was a result of eliminating too much of the contrasting information, making artifact curved and prominent as well. In this case, there was a tenancy for higher false positives in our result sets. The experimentation included high-resolution kernels as well as low-resolution kernels. The filters applied, all using convolution in time domain, were attempts at targeting a frequency range of 5Hz -

15Hz, the area thought to be of interest for targeting seizure activity.

Eliminating gross artifact was necessary in one rat’s EEG recordings. There were areas in the signal that resulted in large, clipped spikes and other artifact. This artifact was spread throughout the signal, and in the range of interest, proving it difficult to eliminate with simple filtering. The solution was to re-shape the signal using a modified boxcar kernel.

The boxcar kernel is named for its impulse response as being shaped like a boxcar - it is a kernel with n positive numbers,all derived from 1/n, and it is commonly used in moving average filters. In this case, a 4-value boxcar was modified from [.25, .25, .25, .25] to [.25, 34

(a) Differential kernel (b) Filter’s output

Figure 3.3: Originally taken from a 4-value boxcar kernel, the new kernel functions as a differential function that forces a signal to become symmetric around the origin - zero.

(a) Before: artifact in the signal can result in a false(b) After: artifact eliminated, having been reduced to positive. background noise.

Figure 3.4: Non-Seizure. (a) Notice wide spread as well as amplitude clipping. (b) Notice that the signal revolves around the zero line, and that the mean line is at or near zero as well. Spikes on both ends are easily classified as non-seizure.

.25, -.25, -.25]. The result was a differential kernel (not boxcar at all) that had a profound affect on reducing low frequency artifact when convolved with the signal.

Essentially, when convolved with the original signal, the kernel forces the signal’s si- nusoidal function to become symmetric around the origin, thus a mean function would lie on zero throughout the function. The overall effect is that high frequency noise falls dras- tically in amplitude around zero, while high amplitude, low frequency maintains prominent symmetry around zero. This technique proved to be valuable in the latter steps of the 35

(a) Before: seizure with artifact (A and B) in the (b) After: seizure (B) with artifact (A) separated. signal.

Figure 3.5: Seizure: Applying the differential filter ultimately leaves seizure intact, while minimizing noise and artifact. In (b), artifact on left is easily classified, while seizure can also be correctly classified.

neuralClustering, creating clearer separation between artifact and low frequency, high am-

plitude integrals around the origin.

M−1 y[i]= x[i] h[i]= h[j]x[i j] (3.1) ∗ − Xj=0

Overall, the idea behind the neuralClustering algorithm is that it be fast, efficient,

and not require unnecessary steps for obtaining results. Preprocessing is recommended only

in cases where classifier training yields results below desired precision, and where it is proven

that precision greatly improves after preprocessing. This is mainly because convolution in

the time domain is expensive with time complexity of (N 2) (see the convolution function: O 3.1).

3.1.3 Summations of Amplitudinal Area

With a properly conditioned signal, the next step in neuralClustering involves inte- grating the signal where it intersects with the bisecting line. This results in obtaining the 36 area for each of the signal’s segments above and below the mean line as it crosses the mean line. Instead of performing actual integration, an approximate summation works just as well, and helps to keep the cost of processing low. Note that l(x ) is compensating for − i−1 the mean line function - we want that excluded from the calculation (see equation 3.2).

The amplitudinal summation calculation was performed using multiple sums with the

Trapezoidal Rule T RAP = (LEFT + RIGHT )/2 which is an approximate technique for

calculating the definite integral where the limit is ∆xi, which is known as the Riemann

integral of f(x) over the interval [a,b] (refer to equation 3.2). The shaded areas in Figure

3.6’s plots show the lower & upper sums. This information is converted into two-dimensional

points where Y holds the value of the area and X holds the duration between first intersection

and second intersection with the mean line (see Figure 3.6).

b n−1 f(xi−1) l(xi−1)+ f(xi) l(xi) Y = f(x)dx − − ∆xi (3.2) Z ≈  2  a Xi=1

(a) Integral (b) Amplitudinal Area

Figure 3.6: (a) Amplitudinal area above and below the mean line is converted into 2-dimensional points where Y = area and X = temporal duration. (b) is actual screenshot of EEG with calculated amplitudinal sums in red. 37

3.1.4 Temporally-Bounded Clustering

Once the integrated points have been calculated and stored, a picture representing a pattern of brain activity over time can be generated. This picture forms a scatter plot. The scatter plot represents functional disposition with respect to time in EEG data, and provides a basis from which attributes may be expressed in discrete terms for further analysis. From the scatter plot, we can observe some interesting spacial and temporally-related qualities found both in normal and abnormal neuronal function. When these qualities are contrasted, it becomes possible to consider data classification for pathological and non-pathological function. An additional step is required in order to observe gradual changes in brain function over time in spacial terms. This step incorporates building overlapping time sequences bound by a specific duration that incrementally moves forward through time. This step essentially clusters points together that are related to a specific window of time.

Figure 3.7: Initial verifiable evidence in Matlab showing clustered movement during seizure: 8-second windows with 7-second overlap. (A) indicates a time before onset at origin, (B) shows beginning of seizure movement, (C) illustrates movement away from origin as seizure progresses, (E) shows how activity goes back toward origin as seizure subsides. Total duration of movement: 2.33 minutes. 38

The first model for this idea was tested in Matlab using 8-second windows with 7- second overlaps over all three of the EEG signals. MatLab served to provide evidence that corroborated the theory that normal brain function remained stationary while epileptiform activity moved away from the stationary origin in clusters; it was the first concrete evidence to be had up to that point in the research. See 3.7. The amount of generated data over- whelmed Matlab as there were thousands of points to animate every second in time. It was clear that a custom program could be created to specifically address performance issues.

In addition, it was thought that by reducing the complexity of the clusters, and reducing the time windows to finer increments, it may be easier to gain more insight with respect to what was happening during the progression of a seizure over time.

3.1.5 Movement of Centroids Through Time

To further reduce the dataset’s complexity, and to increase understanding of seizure progression in EEG data, the points of a cluster within a window were further reduced to a single centroidal point; all points were eliminated in favor of using the cluster’s calculated center point. To gain a better record of movement, it was determined that a 3-second window with a 2.75-second overlap provided sufficient ability to track systemic changes in the EEG signal caused by the onset of epileptiform activity. 3-second snapshots, moved forward in time by .25 seconds, provided a smooth transition with enough refresh rate to follow neuronal activity without skipping over potentially important details. As demon- strated in Figure 3.8, the problem was now vastly simplified by requiring the examination of an immensely reduced problem set.

Recorded centroidal movement provides a simplified model of brain activity. Using

this abstraction, we can better analyze the structure of normal oscillatory function as well as 39

Figure 3.8: Example of moving centroid. Shown is about 11 minutes of EEG activity. Centroid’s current position is marked in red, while its past positions are shaded gray. Current position corresponds to seizure activity. Notice dense area in lower-right; this corresponds to normal brain function. Each point represents a 3-second window of time, while the tail of the centroidal vector reaches back 1 second in time. For visual purposes, the time forward was incremented .05 seconds (2.95-second overlap).

the pathological. One way to gain a better understanding of the brain’s oscillatory function

is to compare models with each other. For example, we can now observe, not only differences

between normal and pathological activity, but also the similarities between similar events.

Figure 3.9 shows how we are now able to overlay seizure patterns for model comparison.

This can also be performed on normal activity as well (see Figure fig:SimilarNormalModels).

It appears quite obvious that comparing models in this way lends itself to many

applications with respect to further research into brain pathologies and prediction. However,

with respect to this research, it serves only to prove that brain function can be consistently

and effectively classified; a machine can be taught to identify pathological brain function

from this scatter plot. 40

Figure 3.9: 4 seizures from 2 different animals (2 for each rat). Notice how the patterns are similar, and the centroids for each cluster (circled in red) are close in proximity one to another. Each seizure represents between 30 and 60 seconds of activity.

Figure 3.10: 9 normal oscillatory sections taken at various times from one animal. Notice how similar patterns can be grouped. Even though there are distinctly two different patterns here, they occur at various times throughout the EEG recordings. This model comparison points towards a way to quantify distinctly different states of brain function that are regularly repeated. Each cluster represents roughly 4 to 20 seconds of activity.

3.2 Classifying Oscillatory Brain Function

A scatter plot based on amplitudinal area vs. temporal duration exists due to the neuralClustering technique described in chapter 3.1. However, before a classifier can be chosen, more knowledge about the data is required. The goal is to find a set of attributes that could be classified using linear classifiers. If linearly separable attributes existed between 41

Figure 3.11: Black clusters are normal brain function, while red clusters are seizures. The larger squares are the cluster centroids. Notice that from this perspective, the data is not linearly separable. epileptiform and normal brain function, it would be a rather trivial affair to process and classify these attributes using a simple algorithm; however, as can be seen from Figure 3.11, the scatter plot in its current form is not linearly separable in its 2-dimensional space. It becomes obvious at this juncture that we must look at these clusters differently, if there is any chance of pulling linearly separable data from them.

3.2.1 A Different Perspective on Scatter Plot

The scatter plot represents brain activity in 2-dimensional space: area vs. temporal duration. Unlike in some applications that deal with clusters of information, an advantage here is that we happen to know that these centroidal clusters are built sequentially in time, even though it is not obvious by looking at the scatter plot. It then becomes a matter of looking at these clusters linearly through time. To do this, the Y value (area) is maintained, while the X value is replaced by an incremented index through time that starts at zero. 42

Metaphorically, this is akin to unraveling a knotted and tangled string by stretching

it out end to end so that it does not cross itself. While the line won’t be flat, these hills

and valleys on the line would be exactly what we desire to examine. We are still working

with 2-dimensions, but now we are on a different plane. In fact, if we were to bisect this

string with another string, we would be looking at a function similar to the original signal

when we bisected it, but this function will have a much simpler structure. The bisecting

line is referred to as the feature threshold. This line is important because it defines a natural

threshold whereby all data below it can be ignored; we are only interested in anything above

the feature threshold.

3.2.2 Human-Assisted Classification

Centroidal Phase Plot

I II III

Figure 3.12: Recap: (I) EEG data and signal prep. to (II) bisection, amplitudinal area vs. temporal dura- tion, bounded and overlapped time windows of activity, reduced to centroidal movement to (III) centroidal scatter plot.

We now have a way to single out certain activity from a massive centroidal cluster.

Refer to Figures 3.13 and 3.14. The software extends the ability to select distinct feature clusters and use those clusters for later training. The cluster itself is defined by all of the points between when function first crosses above feature threshold line and when it falls 43 back below that line. This cluster is known as the feature cluster. For example, if someone

were to highlight a feature of interest, it would look similar to what is shown in Figure

3.14. In 3.14 we see what a seizure looks like, and in 3.15 we can easily see how one could

select a feature cluster and then designate it as being either ”seizure” or ”normal”. This

is the first step in the assisted machine learning approach. We can now create and store

human-classified data that can be later used to train our classifiers.

At this juncture we now have a way to cull human-classified data from our scatter

plot; however, it is still not clear what type of classifier we will be training. To understand

that, we need to further examine these new feature clusters.

Figure 3.13: At the top-right, a scatter plot representing approximately 13 minutes of centroidal cluster movement. Below, stretching from left to right, is the same cluster, only it is ”stretched” out sequentially and bisected with a red mean line known as the feature threshold line. Notice a large ”hill” in the middle. This is a seizure lasting about 46 seconds. 44

2

1

Figure 3.14: Now we have a way to select specific features from a large cluster. In white is a selected seizure feature cluster that crosses above the feature threshold line. The corresponding cluster segment is pictured in purple at the upper right, with its own centroid in red. We have now defined a new cluster that is a subset of the larger structure. This can now be classified as a seizure in discrete terms.

2 4

6

3 1 5

Figure 3.15: Just as we were able to select seizure features in Figure 3.14, we can select non-seizure features and classify them as well. Note that 1 is seizure, 2 is corresponding cluster, 3 and 5 are non-seizure, and 4 and 6 are the related clusters with centroids displayed prominently.

3.3 Selecting Linearly Separable Class Attributes

Experimentation lead to the discovery of several attributes that could be had by examining the feature clusters after the clusters were identified, classified, and stored for training. The first attribute of importance was area over the feature threshold line, the feature area. Notice the white area of selected features in Figures 3.14 and 3.15; it is an important attribute because almost all seizure, with few exceptions found in the sample data, have area that is much greater than non-seizure. This integral was derived in the same 45 way for which area was calculated when creating the original scatter plot, by bisecting the signal (see Figure 3.6 and 3.2 for details).

Another useful attribute was the linear length of a given feature cluster, the arc length.

The arc length was approximated by summing the line segments using Euclidean distance between points for the duration of the activity (see 3.3. Remember that the duration of the activity is defined by when the function first crosses above the feature threshold until it falls back below the threshold. Refer to Figure 3.14, and notice the highlighted purple features of the selected event. The arc length is based on when the event started and when it ended; how ”far” a given point traveled during that time.

It turns out that these two attributes together have a special relationship; they always show high covariance above the feature threshold, and this seems to be the case for seizures observed during this research. In two dimensions, if we use the feature area measurement for the Y axis and arc length for the X axis, we can observe some interesting spacial difference between seizure activity and certain non-seizure activity when plotted (see Figure 3.16). In addition, and more importantly, seizure and non-seizure are linearly separable in a single- dimensional hyperplane because seizure functions tend to have more area, and greater arc length than non-seizure counterparts.

It is important to note that most non-seizure activity would remain in the lower-left corner of the hyperplane, because it remains below the feature threshold. This activity will be ignored altogether by our automated program because it is so insignificant and does not rise above the feature threshold. Only select non-seizure activity needs to be included in the training set, and that activity should only be from a sample that rises above the feature threshold line as seen in Figure 3.14. 46

n l = (q p )2 +(q p )2 + +(q p )2 = v (q p )2 (3.3) 1 − 1 2 − 2 ··· n − n u i − i p uXi=1 t

(a) (b)

Figure 3.16: Feature area vs. arc length: (a) green is normal function, while blue is seizure. Notice the high covariance that exists between X and Y. (b) This data is also linearly separable, and can be used to train a linear classifier.

In addition to area vs arc length, another set of linearly separable attributes was discovered. These attributes are cDistance and pCount. cDistance is a number indicating the average Euclidean distance between a feature cluster’s points and its centroid. pCount is simply the number of feature vectors found in the entire feature cluster. When related to other features they can help us to understand the density of a feature cluster. If we take

area pCount Y= cDistance vs. X= cDistance , together these features give us data that is linearly separable

in a single dimensional hyperplane between seizure and non-seizure. See Figure 3.17.

Now we have two distinctly different classification hyperplanes to work with. The

next task is to select a linear classifier, implement it, and train against it. 47

(a) (b)

area pCount Figure 3.17: Y= cDistance vs. X= cDistance : (a) green is normal function, while blue is seizure. (b) This data is also linearly separable, and can be used to train a linear classifier.

3.4 Results and Comparisons

There were a total of four animals for which EEG data sampled at 1000 Hz was

analyzed. The animal subjects were labeled 246, 288, 307, and 309. In addition to the EEG

data were Excel spreadsheets, created by trained researchers, identifying all the seizures

within each file for each animal. The goal for this research was to get IERSS to automatically

identify epileptiform activity, and match all of the seizure entries previously identified by

trained researchers contained in the spreadsheets for each animal.

It should be noted here that during the perceptron training process, it was necessary

to modify the training set several times per animal. Because the perceptron algorithm does

not produce an optimal hyperplane, it is easily overtrained so that there exists no optimal

margin between the two classes of data. In most cases, less training data worked best so

long as the sampling was good, keeping the training set to an average of five seizures and

fifteen non seizure events.

To ensure proper comparison to other seizure identification methods, a standard of

performance measurement was adopted here:

True positive (TP): correctly classified as an epileptiform event. • 48

True negative (TN): correctly classified as a non-epileptiform event. •

False positive (FP): incorrectly classified as an epileptiform event. •

False negative (FN): incorrectly classified as non-epileptiform event (a miss). •

Sensitivity is defined as: S= TP • TP +FN

Specificity is defined as: K= TN • TN+FP

In our results, the true negatives were not tracked. Essentially, everything not classi-

fied as a seizure qualifies as being TN. As such, actual specificity cannot be provided here.

TP In its stead, we present precision: P= TP +FP

Out of the four animals, animal 246 was the most difficult to train for. The EEG was plagued with much artifact and other anomalous frequency drift throughout. In fact, the initial results for 246 were 89.6% sensitivity with precision of only around 85% due to false positives. In this case, the differential filter (see section 3.1.2.2) was applied to 246 EEG.

The final results saw improvement in both sensitivity (95.6%) and precision (89.7%).

Out of the 42 gb of total data for all animals, 822 of the 851 pathological oscillations were identified which equals a 96.6% sensitivity. Out of the identified pathological oscilla- tions there were 48 false positives, leaving us with a 94.5% precision rate. The final results for each rat are shown in table 3.2. The identification routine was able to process almost

8gb of EEG per hour. This performance isn’t all that bad, but can be improved upon in a big way. The current system places a lot of processing power towards displaying graphical information the entire time it is processing. If the graphical processing were taken out, it is estimated that performance could be increased by over 2x. 49

Rat Sensitivity False Positives Precision

307 267 out of 281 = 95.00% 8 (281/289) = 97.00% 288 136 out of 141 = 96.40% 0 100.00% 246 108 out of 113 = 95.60% 12 (108/120) = 90.00% 309 311 out of 316 = 98.40% 28 (311/339) = 91.70% Total results 822 out of 851 = 96.60% 48 (822/870) = 94.50%

Table 3.2: Results of identifying pathological oscillations in rats from 42 gb of EEG data

3.5 The Custom EEG Software Suite: IERSS

While performing this research a custom software suite was incrementally built by the author to help. As the need arose, new pieces were added to the software. It was meant to be a tool to help visualize concepts, implement algorithms, and prove or disprove certain ideas about the data we were working with. The software has since matured into a ”suite” of tools that have been invaluable to the research and production of this thesis. The software has been dubbed the Invaluable EEG Research Software Suite, or IERSS (pronounced iris) for short. IERSS remains in prototype form, has several bugs in non-critical areas, and lacks professional UI polish and ease-of-use in some areas. Nevertheless, it will continue to evolve in directions dictated by future research.

The entire software suite was developed on the Microsoft .NET 4.0/4.51 framework using C# as the programming language. It was compiled as a 64-bit system, and can only run on a 64-bit OS. The reason for this is to overcome the 4gb resource limit in 32-bit

Windows processes. For various animals, the EEG data per animal could exceed 10gb. It was not necessary for IERSS to contend with advanced memory mapping and other memory management techniques as a 64-bit process. This chapter will briefly cover IERSS features used most for this research. 50

3.6 Visual Signal Inspection and Manipulation

One very important feature to have during this research was the ability to visually inspect the raw EEG data recorded from the animals. This included the ability to scroll forward and backward through the signal, stretch and scrunch the signal, scale the signal vertically, and view the individual sample points in the signal. In addition, the ability to search the signal by date/time indexes was very important. This allowed use to locate predetermined seizures and artifact for further investigation and results comparison.

Figure 3.18: IERSS preview window allows user to search and inspect parts of the signal before loading the data.

The ability to apply bisections to the signal as well as highlight signal intersections

proved helpful in allowing us to pick up drifts in amplitude and frequency. The ability to

point and click on any region within the signal and obtain information at that offset like

date/time and area above/below the bisecting curve helped with forming initial theories

about the data in the signal. IERSS even extends a feature that allows the user to preview

sections of the signal before loading it into the main interactive window (see Figure 3.18).

All of these features were designed and implemented from scratch to ensure maximum

usability and flexibility for this research. 51

Figure 3.19: IERSS allows visual inspection and the application of a bisecting line, too. the bisecting line is displayed in red, while the physical intersection is displayed using blue boxes. The zero line is a thick gray horizontal line.

Figure 3.20: IERSS allows visual inspection of a selected intersection between the signal and the bisecting line function.

Figure 3.21: IERSS allows the user to turn on shaded area view for inspecting signal features where it crosses the bisecting line function. 52

Figure 3.22: IERSS offers the ability to zoom into the actual individual signal samples for closer exami- nation. 53

3.7 Data Transformation and Visual Animations

The results of this research heavily rested on the ability for IERSS to perform. IERSS has the entire implementation for the neuralClustering algorithm, data classification, train- ing tools, and the automated epileptiform identification algorithm. As explained in chapter

3.1, the neuralClustering algorithm takes EEG data and transforms that data into moving centroidal information through a series of steps. Throughout the entire process, it was im- portant to have visual confirmation as well as the discrete mathematical output to verify correctness and to reinforce understanding of the data’s nature as well. The visual tools include plot animation (Figure 3.8), visual time synchronization (Figure 3.23), and plot model overlays for data comparison (Figures 3.9 and 3.10).

Figure 3.23: IERSS performing centroidal animation time-synced with the signal at the top. Notice the blue vertical bar over the signal. The bar’s width indicates the size of the time window (3 seconds), and it moves in sync with the moving centroid below it. At the bottom is the feature threshold line as described in chapter 3.2; it is being formed in sync with centroidal movement. 54

3.8 The Data Classification Tool

Prior to training the classifiers, there must be a way to classify the data. The classifier tool was built into the scatter plot analysis user interface. This functionality is described in chapter 3.2; section 3.2.2. The tool is simple: the user selects features above the feature threshold line, and classifies it either seizure or non-seizure (see Figures 3.13, 3.14, and 3.15).

The user is not limited to how many classifications are performed. Once classified, the data is saved in simple ASCII delimited files for use by the training tool. The information in the

file includes all the points in the attribute cluster, feature area, date/time offset, sample rate of signal, duration of event, name of originating EEG file. Refer to Figure 3.24 for a look at the entire classification tool UI.

Figure 3.24: Shown here is the classification UI. The user may select any feature above the feature threshold by clicking on it in the bottom window (shown in white). The associated points are highlighted in the scatter plot window for visual confirmation (purple with red centroid). The user may classify the feature by selecting ”seizure” or ”non-seizure” from the dialog. 55

3.9 The Training Tool

Training happens in a two-stage fashion; both perceptrons are trained independently.

area pCount After selecting the perceptron to train, ( cDistance vs. cDistance ) or (area vs. arc length),

the user has a choice to include any number of classified data elements in the training set.

After the perceptron and training elements are selected, the user clicks the ”Train” button

to invoke the PLA. This process must be done for each of the two planes. If the PLA cannot

converge, the user will be informed. To get around non-convergence, the user can change

the training elements in the set. After training both perceptrons, the result is a training

set for two planes to be used by auto-identification system. See Figure 3.24 for UI details.

Figure 3.25: The classification training tool allows training on two different planes, and allows the user to select which elements from the training set to train against. Once the plane and elements are selected, clicking the ”train” button will invoke the associated PLA algorithm.

3.10 A Classification Example: the Seizure Identification System

The purpose for developing IERSS, and the reason behind this thesis is, of course, to support a way in which an automated system can be built to identify epileptiform oscillatory activity from EEG data. To that end, a system was built into IERSS to utilize the trained 56 perceptron-based neural network for applied automated identification of seizures. To test the effectiveness of such a system, 42 gb of EEG data obtained from 4 laboratory rats, each containing between 480 to 648 hours of data, was processed through this system. Seizures for these animals were previously identified and recorded manually by trained clinicians.

The records included dates and time offsets for each identified seizure. IERSS was trained for and run against each rat individually. The results were then compared against each animal’s log.

Figure 3.26: The seizure identification system, like all other IERSS functions, is visual. It keeps a running log, and highlights identified seizures when discovered in the bottom feature threshold window. The window displays the the current segment of EEG file being analyzed

The identification system is easy to use. Point the system to a directory where the trained classification data resides, and also to where the EEG files are stored for a particular rat. Clicking ”Go!” starts the process. The system will process all EEG files in the directory and export text logs and visual thumbnails of all identified seizures. The logs and thumbnails help to confirm correctness and to compare results against the manual identification logs.

The system captures features that rise above the feature threshold line (detailed in section

3.2.1) and sends the feature cluster to the neural network for classification. If the feature 57 cluster is positively identified as seizure, an entry is made into the log, and a snapshot of the visual structure at that location is saved to disk to aid in user verification later. Refer to Figures 3.26 and 3.27.

Figure 3.27: Here is an example of the thumbnails IERSS produces upon positive identification of seizure.

3.11 Conclusions

Note that despite the good performance and valuable visual insights offered (an initial

approach towards explainability and assistive classifier construction), these results could

only be achieved by re-training on each subject; a time-consuming process. Since this was

applied to a clinical tool meant for cutting time clinicians spent on identifying seizures in

multi-day continuous recordings or observing simplified animations of brain function within

various contexts, the work and time involved in training and tweaking was acceptable in

order to save many hours of work on the backed; this is not the case with BCI applications.

Furthermore, this particular technique of producing hand-crafted attributes only works 58 with temporally sustained ERP attributes that stand out from the signal noise - something that’s an impossible situation when dealing with EP due to extremely low SNR. You will notice that a lot of work and time-consuming analysis was required to ensure we culled to proper features from the signal. In cases where artifact was prominent, a differential

filter was applied to the signal in order to reduce it; the culled features were otherwise contaminated. Finally, a clinician would need to select the “best” channel to analyze, since this algorithm did not take multiple channels into account. This system’s shortcomings notwithstanding, it provides extremely useful insight, fueling many of the ideas towards improving BCI classifiers covered in this dissertation. CHAPTER 4 REVIEW OF ERP AND POPULAR BCI CLASSIFIERS

BCI systems are built around one or more sub-type of EP (i.e. ERPs). ERPs, when matched to tasks (i.e. a given context) provide the basic building blocks for a BCI system.

Research into the classification of these ERPs finds that certain classifiers perform better for some ERPs over others. Here we examine ERPs more closely so as to create a clearer picture of their nature insofar as our own research interests. In addition, we examine various approaches to classifying ERP. The information herein is vital to our research, because it provides some of the foundational elements whereupon we build our work. However, this is all standard ERP/BCI and those familiar with the literature can jump ahead to chapter 5.

4.1 Event-Related Potentials (ERPs)

ERPs are measured re-

sponses in the brain that are the

result of sensory, cognitive, or mo-

tor events; they appear in response

to some specific stimulus [197].

ERPs are low-frequency compo-

nents that make up a transient

brain response to external stim-

uli. Various responses propagate Figure 4.1: Various ERP through time with locale. Source: [57]

throughout different areas of the

brain over a period of time, and manifest in stages as the brain processes the stimulus.

EEGs can be used to effectively record and measure these ERPs. 60

Because EEGs reflect thousands of simultaneous brain processes (see Appendix B) , the brain’s response to a stimulus or event may not be readily visible in the EEG recording during a single trial. To cull a response from a signal, it may be required to conduct several

(perhaps many) trials, and then average the results together, forcing random brain activity to be averaged out while the relevant waveform can be further analyzed.

ERP is actually a category of event-related responses. There are several different

ERPs that are used to accomplish specific BCI-related tasks that we will briefly mention here. Some important descriptive characteristics for ERP waveforms (refer to Figure 4.1)

[197, 224] are the following:

Positive potential: P • Negative potential: N • Followed by a number indicating latency in milliseconds (such as in P300) • May be followed by a number indicating the major peak order. Example: P300 is • oftentimes referred to as the P3 wave because it is the third major positive peak in

late sensory evoked potential. Potentials with latency 100 ms are exogenous • ≤

– influenced by physical attributes of stimuli such as intensity, modality, and rate

Potentials with latency >100 ms and < 250 ms are mesogenous •

– lie at the interface of purely exogenous and endogenous components (Picton,

1980) – often referred to as the vertex potential components – thought to represent the various integrative and inferential processes of percep-

tion and early attention – the three main (independent) mesogenous components are the N1, P2 and N2

Potentials with latency 250 ms are endogenous • ≥ 61

– psychological or cognitive attributes of stimuli – nonobligatory responses to stimuli – vary in amplitude, latency, and scalp distribution

4.1.1 Steady-State Visual-Evoked Responses

Visual evoked potentials (VEPs) represent the processing of visual information in the primary visual cortex and its related pathways. Steady-state visual-evoked potentials

(SSVEPs) are VEPs that correspond to high stimulus rates modulated above 6Hz. SSVEPs may also be referred to as steady-state visual-evoked responses (SSVERs) [237,250]. SSVEP was the first type of EP used in BCI systems. There are two important physiological mechanisms behind SSVERs that make them particularly useful to BCI systems [250]:

The photic driving response: an increase in amplitude at the stimulus frequency results • in significant fundamental and second harmonics The central magnification effect: amplitude of SSVEP increases significantly as the • stimuli is brought closer to the center of vision.

The fundamental idea behind SSVER-based BCI systems is to have each available choice in the system encoded with a different (flashing) frequency (above 6Hz). As the user moves the target choice into the center of vision, a SSVER is produced with the same frequency, thus allowing the system to decode the associated command.

4.1.2 Sensorimotor Rhythms [255]

Sensorimotor Rhythms (SMRs) emanate from the sensorimotor cortex region of the brain, which is the area associated with coordinating muscle motion and feedback from that motion. BCIs based on SMR are typically designed to assist with some sort of thought- based movement. These commands might be translated into simple cursor movements on a screen, or transferred into an external robotic device for physical locomotion. SMR (also 62 referred to as Mu waves) oscillate between 8-12 Hz in what is commonly known as the alpha range of brain function.

In some instances, the user (subject) of the system must learn to hone the modulation of brain wave patterns so that they are made prominent in the signal, thus allowing for easy

(or easier) decoding by the BCI system. Oftentimes feedback to the user is provided to improve this process for the user.

4.1.3 The P300 Wave

As shown in Figure 4.2, the P300 ERP (also referred to as the P3 wave) component occurs in the parietal lobe, and is maximally recorded from the mid-line centroparietal elec- trodes (at least Fz, Cz, and Pz) [201]. It is believed that the P300 wave may represent the transfer of information to either consciousness or memory. This may involve cognitive functions such as attention, contextual updating, response modulation, and response resolu- tion [174]. The P300 wave is referred to as a late positive component. While its latency can vary with the difficulty of discriminating the target stimulus from the standard stimulus, a typical peak latency is around 300 milliseconds (ms); however, this can vary to be between

250 ms and 600 ms. P300 can be elicited from stimuli involving any sensory modality, and occurs only if the subject is actively engaged in the related task of identifying a specific target. Moreover, its amplitude may vary with the improbability of a given target, making the P300 wave a very popular ERP component used in BCI systems [171]. More details are provided in Section 4.2.

4.2 P300 BCI: A Closer Look [171, 174]

In this study, time-locked ERP like P300 are of primary focus; therefore, we wish to offer more details about this particular ERP to the reader. To better understand how 63 the P300 BCI works, a better understanding of the P300 ERP component is warranted.

The P300 BCI was first introduced by Farewell and Donchin in 1988 [66]. It received little attention then, and between 1988 and 2000, there were actually no peer-reviewed papers published on the matter [59]. However, in recent years, P300 has been pushed to the forefront as a major BCI category [68]. The P300 wave itself is also referred to as the P3 component, and it is recorded in the parietal lobe, maximally recorded from the mid-line centroparietal electrodes (at least Fz, Cz, and Pz) [197, 245]. It is believed that the P300 wave may represent the transfer of information to either consciousness or memory. This may involve cognitive functions such as attention, contextual updating, response modulation, and response resolution [174].

The P300 wave is referred

to as a late positive component.

While its latency can vary with

the difficulty of discriminating the

target stimulus from the standard

stimulus, a typical peak latency

is around 300 milliseconds (ms);

however, this can vary to be be- Figure 4.2: Prominent P300 Peak. c 2010 Source: [5]

tween users from 250 ms to 600 ms

or more. P300 can be elicited from stimuli involving any sensory modality, and occurs only if

the subject is actively engaged in the related task of identifying a specific target. Moreover,

its amplitude may vary with the improbability of a given target, making the P300 wave a

very popular ERP component used in BCI systems. How this endogenous component works

is quite interesting. 64

The P300 is not elicited in the absence of the subject’s explicit attention, making it a desirable component to work with for BCI. Of course, this also implies that attention state and state of mind are important factors when utilizing a P300-based BCI; one may also think of the terms engagement and arousal as being important players here. Oftentimes,

the P3 is broken into two discerning components, P3a and P3b. There are essentially

two important factors that serve to modulate the P3 wave: (1) unexpectedness/stimulus

frequency, and (2) task relevance; these operate independently each from the other.

Together, these factors produce varying P3 components. P3a is more prominent

during unexpected or novel stimulus - external stimulus that the subject’s brain has not

fully acclimated to; it occurs early in the P3, and tends to have a sharper peak as novelty

increases; it is prominent when the user associates the stimuli with a desired tasks. The P3b

component is affected by how relevant the stimulus is to the task at hand, and although

relatively prominent may not possess as steep a slope as P3a. Essentially, it is thought

that the P300 works in a two-stage committal process. P3a signifies an event and how

novel it might be, while P3b determines how relevant it is to the task at hand [171]. A

prominent P3b component indicates that the novel event is likely committed to working

storage. Another way of looking at this is that P3a has a high peak when a novel stimulus

is introduced and P3b indicates how much attention was given to that stimulus. Together,

P3a and P3b are the essential targets of the P300 BCI system.

P300-based BCI systems typically employ an ”active oddball” response technique

whereby regularized patterns of stimuli are passed over all possible choices made available

by the system. Intermittently, an oddball pattern is applied to each of these choices, and

is designed to elicit a prominent P3a component. When this oddball pattern is applied

to the actual target choice (the relevant task), the subject’s brain produces a prominent 65

P3b wave. Together, P3a and P3b make up the modulation of the P300 wave that occurs

approximately 300 ms after the relevant stimulus. This response can be detected by the

BCI and decoded into a command that is then executed by the system. One of the most

important tasks of a P300 BCI system is to maximize the differences between target and

non-target ERPs. Since the P300 wave can be used for a variety of sensory modalities,

research continues with P300 BCI systems [68]. Recall our review of the P300 Speller in

section 2.3; it works similarly to the basic technique describe in section 4.3 below.

4.3 EP Detection

The goal of a good EP-based UI design is formed around soliciting the ideal EP waveform.

There is no denying that the BCI [user] interface is a very important component as it provides the way for a user to interact with the system; however, it is just as important that the EP components elicited from the user are quickly detected within the ERP and accurately classified.

There are two important classification stages Figure 4.3: P3a and P3b components dur- ing three different stimuli runs. Source: [29] involved in decoding EP, (1) the detection of the EP response components (usually binary), and (2) recognizing the intended/associated task.

This is typically achieved by examining the accumulation of one or more ERP compo- nents within a certain time window from the initial stimuli evocation [39]. A good BCI system requires a fast and reliable ERP detection scheme with short training periods and instantaneous interaction between the user and the system; detecting EP from raw EEG is challenging because of the non-stationarity within the signal, along with noise from other 66 background EEG activities [121]. To better understand the challenge of detecting ERP from EEG we shall examine them in detail.

There are a number of natural properties to EEG that make classification difficult.

First, we are dealing with time series (TS) here, and there are certain characteristics present in time-series data that can make it different from other forms of data. The basic challenges with this data: [116]

Sampled TS data tends to be noisy, with high dimensionality. • In many cases it is unclear as to whether there is enough information contained in the • data to understand the process (this may not be the case for P300 alone, but when

combining P300 with other components, this could be true). TS data has an explicit dependency on the time variable, making more input from • the past necessary, or it may require a memory of past inputs. TS can be non-stationary: mean, variance, and frequency tend to change with time. • Invariance: domains such as vision require that certain features are invariant to trans- • lation, rotations, and scale; by comparison, TS may require that features be invariant

to translations in time.

Furthermore, these constraints bring up two issues of concern for this research [63]:

Data representation: How can we properly represent the shape characteristics in a • given signal? What invariant properties should the representation satisfy? How may any given time-series pair be distinguished or matched? How can percep- • tually similar objects be recognized, despite not being mathematically identical? 67

Figure 4.4: EEG screen with high frequency noise, eyelid blinks and epileptiform events (black arrow). ( c Adriano O. Andrade et al 2013)

4.4 A (Brief) Review of Traditional Classification Algorithms for BCI

This section is not critical for the reader, provided the reader is familiar with these

algorithms. Chapter 6 provides experimental results and comparisons for the most popular

among these algorithms.

Great strides have been made towards improving EEG classifiers for BCI in the last

ten years. Here we review a handful of the popular classifiers that we will refer to as the

more traditional classifiers. We will also review what’s currently being done for BCI with

deep learning. Keep in mind that many of the classifiers reviewed below rely on EEG

preprocessing, filtering, and/or transformations; our research differs in that we aim to avoid

transformations and hand-crafted features typically fed to EEG classifiers. With respect

to feature extraction, EEG is typically filtered in the frequency/time domains as well as

in the spatial domain [135, 262]. We will begin with examining some of the more popular

traditional set of EEG classifiers. 68

4.4.1 Linear discriminant analysis (LDA)

LDA is regularly applied to EP BCI for classification. The advantage of LDA is the low computational effort required to use it; additionally, it is simple to use with which to achieve good results [113]. LDA provides rapid training because it does not require any parameterization [11]. LDA is based on principal component analysis (PCA) while it explicitly attempts to model the difference between multiple classes. Performance with the

P300 BCI was compared with step-wise linear discriminant analysis (SWLDA), Bayesian linear discriminant analysis (BLDA), linear support vector machine (LSVM), and Gaussian supported vector machine (GSVM) with no statistical difference in overall performance between them [11].

In another paper, LDA (FLD) was juxtaposed with kernel fisher discriminant (KFD), support vector machine (SVM), and a general single, double, and triple-layered neural net- work (NN). LDA performed better overall, according to the results [154]. While LDA may not work well for BCIs where linear classifiers perform poorly, LDA seems to perform well for both motor imagery and ERP [136]. It should be noted that there are many papers that incorporate specialized component analysis spin-offs for ERP. For example, indepen- dent component analysis (ICA) [201], and constrained independent component analysis

(cICA) [120] have been found to work well for denoising and as channel filters, thus provid- ing improved separability.

4.4.2 Support vector machine (SVM)

As with LDA, SVMs also use a linear discriminant hyperplane to differentiate classes;

however, the selected hyperplane maximizes the margins (distance from the nearest training

points) between the classes and discriminant line [221]. SVMs are known to possess good 69 generalization properties, to be insensitive to over-training, and handle high dimensionality well; however, the advantages come at a cost, namely slower execution [136]. Notwithstand- ing, SVMs were used in P300 speller BCI competitions in 2003 and 2005 [154]; however, it is unclear at this time if LDA, in fact, consistently outperforms SVMs (all else being equal).

4.4.3 Random Forests (RF)

Apparently, RF shows improved accuracy over SVM when fewer repetitions are used for stimulus in the user interface, potentially improving the speed at which the user can use the system. Together, with user interface improvements, the system provides a 51.87 percent improvement over the traditional P300 speller interface with around .6 words per minute [4].

4.4.4 Dynamic stopping (DS)

In short, DS works by determining the amount of data collection based on a confidence that a particular task (or character in spellers) is the correct target. Several papers coupled it with a data-driven Bayesian approach that worked well when applied to BCI [140, 236].

While this technique is interesting, it does not appear that it has improved performance in any meaningful way that we can tell.

4.4.5 Hidden Markov Models (HMMs)

HMMs may hold the ability to increase P300 BCI throughput. A 2008 paper demon- strated promising results towards P300 BCI improvement with HMM [91]. A subsequent paper showed that the use of HMMs garnered better performance and throughput over naive Bayes and dynamic stopping [211]. 70

4.4.6 Convolutional Neural Networks (CNN)

Out of the host of neural networks used, CNN seems to be a popular choice for

P300 BCI research. While CNNs are successfully applied to imagery and vision-related

learning [62], CNNs have potential for successful application in BCI domain yielding state

of the art performance. Interesting results from one study suggests that template training

for tactile-based P300 BCI is possible utilizing a CNN classifier [112] with a supposed 100

percent accuracy. Another paper utilizes multiple classifiers through a CNN to detect the

P300 in the time domain. The writers suggest they discovered a multi-classifier solution

with a recognition of 95.5 percent (training is performed per individual). The network they

used included a ”special” topology that contained more than one hidden layer [39], leading

us to consider perhaps a better approach.

The drawback of many feature learning systems is their complexity and computational

expense, and yet it may not be required that we contend with much of it. For example, some

research suggests that simpler constructs such as single-layer K-means might work better

than multi-layered auto-encoders, RBMs, and Gaussian mixtures for certain datasets. The

key may be tied to using a larger number of hidden nodes with dense feature extraction [51].

4.5 Riemannian-based Classifiers

We do not unpack the application Riemannian classifiers in this study; however, we

do employ one in our experiments. The research community holds out high hopes for BCI

with Riemannian classifiers [32, 135, 251], so it’s worth mentioning here. However, there is

no evidence that this category of classifiers is offering much beyond its purported potential

at the time of this writing. 71

4.6 Deep Learning

In the context of BCI, most research interest since 2014 lies with deep learning. Much of this work focuses around discriminative models, with CNNs featured in up 70% of these papers; however, representative models as well as generative models are of considerable interest as well [262]. Even though our focus is on discriminative models, we do explore how representative models might perform in contrast by examining autoencoders coupled with domain adaptation (i.e. transfer learning).

There are a host of papers presenting various neural network topologies and applica-

tion towards improving BCI performance [1, 2, 132, 139, 207, 233, 256]. At the time of this writing most of the research into modern machine learning incorporates the exploration and use of deep networks. Therefore, it is of no surprise that this seems to be the case with

BCI as well. Most of the more recent BCI-related papers have deep learning as a theme.

Compared to some domains (like vision and NLP, for example), BCI has comparatively little deep learning (DL) applied to date. What’s more, deep learning performance has not caught up with other methods used for BCI; it’s suspected this is due to the lack of large training sets often required for good DL performance [135]. For example, Riemannian and

Boosting models are purportedly better performers for P3008 [93, 251].

DL has become a popular tool in the machine learning domain because of its potential to simplify overwhelmingly complex systems. Using the proper constructs and configured topology for a given system and problem domain, DL is capable of discovering and describing important features in complex, high-dimensional systems through its raw data. Essentially, deep learning is able (in many cases) to discover important aspects of a system, while reducing the irrelevant, thereby significantly simplifying (transforming) the system. This simplification then makes it possible to perform effective analysis such as classification and 72 predictions [84]. Deep learning is setting state of the art performance in many major areas of machine learning today [118]; therefore, it makes sense to make DL a significant part of our exploration and application during this research.

As mentioned earlier in this chapter, EEG is nonlinear, multivariate time series due to the fact that we are gathering data over multiple electrodes (channels), and in such a multidimensional system, it is all but impossible to know and measure all of the relevant variables [224]. However, from a practical standpoint, we can overcome this mathemati- cally by monitoring a single variable (i.e. voltage) over time to gain an understanding of the important dynamic aspects of the system. By monitoring more than one variable over time (multiple channels), we we can follow the dynamics of interaction of different parts of the system we are investigating [197,224]. Easier said than done, as the standard methods for time-series analysis are not applicable (e.g., power analysis, linear orthogonal transfor- mations, parametric linear modeling). Not only do these methods fail to detect critical features, they may falsely suggest random noise [224] as well.

4.7 Deep Transfer Learning

This topic is covered more deeply in Chapter 7. Applied deep transfer learning for

BCI is still in its infancy. At the time of this writing only the most recent papers on BCI and transfer learning since 2018 appear to focus on the utilization of deep learning models; this study focuses heavily in this area.

4.8 Conclusion

EEG is complex; it’s noisy and highly non-stationary (even between back-to-back trials recorded from a specific user). Most of the classifiers in use today to not generalize across users or tasks. Hand-crafted features tend to decimate potentially important infor- 73 mation embedded within EEG. There is plenty of research that points in many directions towards finding optimized classifiers that may perform well to BCI systems; there’s also a lot of “noise” in the research itself. There is no clear performer that generalizes across users and applications (tasks). Deep learning holds great potential for offering up good performance for BCI systems. We draw this conclusion based on results in other fields of applied machine learning and from the high number of recent papers applying DL to BCI; however, deep learning in BCI suffers from the lack of abundant data. Coupling domain adaptation to deep learning may hold much promise towards the amelioration of BCI sys- tems. It may help our classifiers generalize well despite having access to only small amounts of clean datasets. CHAPTER 5 THE DATASET & ANALYSIS

We searched for a dataset that contains many subjects. Almost all available datasets for this type of research only contain between 3 and 30 subjects. We eventually found an ideal dataset for our purposes, despite the fact that its original purposes was for studying something entirely different. We settled a dataset from a large study that examined EEG correlates of genetic predisposition to alcoholism (1995). While the dataset is somewhat dated, it is one of the few publicly available EEG datasets of sufficient size and labeling that meets our needs. There are several papers that can be found that used and/or described the dataset chosen for this research: [21, 52, 97, 134, 204]

The dataset includes 122 subjects with 2 groups: alcoholic predisposition (77 people) and control group (45 people). It is a balanced dataset in that trials were split roughly 50-

50 between a form of popular go/no-go experiments often used in biological research with animal and human subjects (explained later as match/no-match trials). 64 scalp electrodes

(channels) were used with a standard 10/20 extended spatial configuration. Data was recorded with a sample rate of 256 Hz with an applied bandpass filter between .02 Hz and

50 Hz. There were 120 trials per subject where recordings were segmented into 1-second windows which included 3.9ms epochs. Overall data size amounted to 700 MB and 14,640 unique samples (122 subjects times 120 trials).

Subjects were seated in a reclining chair in a sound attenuated room. A computer display was placed at a point in the center of the room 1 meter away from subject. All subjects had normal vision, and were right-handed. Stimuli were composed of 90 images chosen from a 1980 Snodgrass and Vanderwart picture set. All images represented concrete 75 objects against a white background that were easily identifiable. Images were 5-10 cm in height, 5-10 cm wide.

Figure 5.1: Typical 10/20 extended electrode scalp placement.

To elicit event-related potentials (ERPs), a delayed matching-to-sample task was

used. Two picture stimuli appeared in succession with a 1.6 second inter-stimulus interval.

First, a random image (S1) from the set was display for 300ms. The image that followed

was either a match to the first (S2 Match), or a different image altogether (S2 No Match).

There was a 3.2-second interval between trials. All trials were randomized and balance; one

half of the trials contained matches, while the other half did not. Each subject’s task was

to decide if the images were matching or not. The subject was asked to press a button in

one hand if the images matched or press a button in the other hand for no match.

A few things to note: removing bad data (i.e any trial that had bad data in one or

more channels) and omitting the initial image flash for each epoch, we were left with roughly

7,150 trials for experimentation. We also removed the ground and reference channels from

the set, leaving 61 channels for each trial. With a sample rate of 256Hz, each 1-second trial

consisted of 15,360 samples. 76

5.1 Why Alcoholism?

Initially, this dataset might seem irrelevant to the task of studying BCI classifiers.

However, this dataset provided us with certain advantages not commonly found in typical

BCI datasets. First, it’s one of the largest, more consistent EP datasets available. Second,

it’s a well-curated collection of experiments consisting of balanced, randomized trials. Third,

this dataset provided us the opportunity to explore two different BCI-related scenarios: (1)

clinical (alcoholic vs control), and (2) assisted task-based BCI (match vs no-match, i.e.

go/no-go). Lastly, we were able to explore whether we could get classifiers to generalize

across clinical groups of individuals with potentially different brain patterns.

5.2 The Analysis

As with any dataset chosen for research and experimentation, it is important to

understand some statistical details so as to gain a better understanding of the information

therein. We started by examining a single trial with all of its channels in both 2D and 3D.

At point ZERO in time, is when an image is flashed on the screen. We cannot easily observe

the evoked response as it progresses spatially through a one-second period in time; despite

the initial bandpass filter, clearly there remains too much noise in this trial to clearly see

anything (refer to Figure 5.2).

If we opt to become more selective in our observations, we can get a clearer picture

perhaps by separating out the groups of subjects and then taking the mean across all

channels for all subjects in each group. Indeed we can see a difference between the two

groups using this method (refer to Figure 5.3); this suggests that we might be able to

separate out subjects from the control group vs those with a high risk of alcoholism. Still,

there is more to uncover here. 77

We’d like to see whether or not it’s possible to separate out the difference in ERP that relate to the match/no-match trials. Also, we’d like to know if there are substantial differences between the groups there as well (refer to Figures 5.4 and 5.5). When we do this we can begin to see even more differences between ERP patterns, not only in the match vs no-match, but also between the groups. This leads us to hypothesis that it may indeed be possible to, not only differentiate between high-risk (HR) and low-risk (LR) subjects, but also between the tasks related to the ERP of the subjects. Of course, averaging across all subjects and channels presents an overly-simplified temporal picture; however, this is an encouraging start.

Figure 5.2: Single trial, random subject, 64 channels in 2D and 3D (inset images); black line is the mean across all channels.

We notice that in each of these groups of averaged trials there are some interesting patterns about 300 milliseconds into the trial, from the onset of the visual stimulus. This is where the P300 wave appears. So, to start, we can see that the no-match trials invoke a prominent P3 waveform while the match trials consistently invoke lower P3 response. A 78

Figure 5.3: Mean across all trials, channels and subjects per group.

Figure 5.4: Mean across only no-match trials, all channels and subjects per group. no-match elicits a clear novelty response related to intent from the subjects; however, the

HR group exhibits a much smaller P3 response. This could mean that certain BCI systems based on P300 will be less effective for the HR group vs. the LR group. On the other hand this could also means that we should be able create classifiers that could delineate subjects based on whether they are at high risk for alcoholism or not. 79

Figure 5.5: Mean across only match trials, all channels and subjects per group.

So far we’ve gotten some interesting details by simply observing overly-simplified temporal properties. Could we gather more useful information by visualizing the spatial components the dataset? By sampling the mean across each channel, but over all no-match trials for each group, we can see some interesting patterns emerge.

It’s tempting to conclude that classification should be a mere simple task for a given classifier; however, we must keep in mind the difference between the single random sample vs the smooth patterns that emerge over the mean of many trials and subjects. To bring us back to reality, we only need to confirm the noisy, non stationary properties of EEG. Notice that even for a single subject, no matter which subject, the data distribution changes from one trial to the next - even if the trials are of the same type. 80

Figure 5.6: Dipoles representing the mean across no-match trials for all HR subjects, for each channel. Notice heat map of P300 response that corresponds to the thin green strip around 300ms into the trial vs the same trial for LR subjects in Figure 5.7.

Figure 5.7: Dipoles representing the mean across no-match trials for all LR subjects, for each channel. Notice heat map of P300 response that corresponds to the thin green strip around 300ms into the trial vs the same trial for HR subjects in Figure 5.6. 81

Figure 5.8: Histogram and kernel density estimation for a single subject, 3 randomly selected no-match trials. CHAPTER 6 EXPERIMENTAL COMPARISON

To gain a better understanding of how well some of the more popular classifiers per-

form, we decided to conduct some hands-on experiments for an apples-to-apples comparison.

We selected several classifiers from the group of, what we refer to as, traditional BCI clas-

sifiers (see section 4.4 for reference). Each classifier is trained on the same dataset using

the same proportions as described below. We compare the performance of each classifier

against all others in the group; these experiments provide a baseline use to assess our trans-

fer learning models in see chapter 7. In addition, these models were chosen as a category

exemplar to provide a better understanding of a different category’s overall strength and

weakness for various tasks.

6.1 Model Overview

We selected seven models: a simple feed-forward neural network (ffnn), a convolu-

tional neural network (CNN), gradient boosting (gb), random forest (rf), linear discriminant

analysis (lda), logistic regression (lr), and Riemannian minimum distance to mean (mdm).

6.1.1 Feed-forward Neural Network (FFNN)

We built a simple ffnn as a simple baseline. The goal here (as was with our CNN model, too) was to use good modern elements while keeping learned parameters to a small number given the small amount of data we had to work with. Surprisingly, our ffnn per- formed better than anticipated when contrasted with the other models. These early results indicate that there is potential for obtaining good performance with FFNNs in future stud- ies. For this work we constructed the following ffnn model: 83

Input layer: fully-connected layer of 64 nodes that accepts attributes from across all • 64 channels; a linear activation is used. Normalization layer (Layer Normalization): applies layer normalization over all inputs • for the mini-batch as explained in [20]. Unlike batch normalization and instance

normalization, which apply scalar scale and bias for each (entire) channel using an

affine option, layer normalization applies per-element scale and bias with an element-

wise affine option: x E[x] y = − γ + β (6.1) V ar[x]+ ǫ. ∗ p Where γ and β are learn-able affine transform parameters over the dimensions of our

input. Exponential Linear Unit (ELU) Layer: uses a smooth, slow-rising activation function • and does not suffer from vanishing gradients or exploding gradients. In addition, ELU

leads to less training times and higher accuracy in neural networks compared to RELU

and its variants. See the function and its derivative below.

α(exp(x)–1) for x 0  ≤ f(x)=  (6.2)  x for x> 0  

f(x)+ α for x 0 ′  ≤ f (x)=  (6.3)  1 for x> 0 Dropout Layer: Oftentimes dropout is used to prevent our model from over-fitting; •  this is a form of regularization. Fully-connected layer of 128 nodes; linear activation • Sigmoid Layer: Uses a sigmoid (“s”-shaped) activation to determine whether to clas- • sify a sample as “1” for alcoholic or match or “0” for control group or no-match. 84

6.1.2 Convolutional Neural Network (CNN)

Our convolutional neural network performed slightly better than our ffnn model.

CNNs tend to possess more efficiency over FNNs because, unlike FFNNs, CNN nodes are sparsely connected. Our CNN consists of the following layers:

3 1D CNN layers • Note: tensors in convolutional layers must equal the same dimensions; for 1D: width =

((samplesize filtersize) + 2 padding)/(stride + 1) − ∗ Normalization layer (Layer Normalization) with 4096 nodes • Exponential Linear Unit (ELU) Layer • Dropout Layer (to prevent over-fitting) • Linear Activation Layer with 4096 nodes • Normalization layer (Layer Normalization) with 1024 nodes • Exponential Linear Unit (ELU) Layer • Dropout Layer (to prevent over-fitting) • Linear Activation Layer with 1024 nodes • Normalization layer (Layer Normalization) with 128 nodes • Exponential Linear Unit (ELU) Layer • Dropout Layer (to prevent over-fitting) • Linear Activation Layer with 128 nodes • Sigmoid Layer: Uses a sigmoid (“s”-shaped) activation to determine whether to clas- • sify a sample as “1” for alcoholic or match or “0” for control group or no-match

6.1.3 Random Forest (RF)

Random forests consist of a large number of individual decision trees that operate together as an ensemble. Ensemble methods use multiple algorithms together to produce better performance over what could be achieved by the individual algorithms alone. In the case of Random Forests, a large number of uncorrelated trees operating together as a committee will work better than any individual tree in the collection. We chose to include 85

RF in our experiments over SVM due to its ability to perform better with less stimulus (see section 4.4.3). In these experiments, we trained a random forest classifier on our training set using a max of 500 individual decision trees based on experimentation we perform using various numbers of trees.

6.1.4 Gradient Boosting (GB)

Gradient boosting is a very powerful technique for building predictive (learning) mod-

els. Essentially, boosting turns weak learners into very strong learners. A weak learner on

its own may return results only slightly better than random chance. Boosting works by fil-

tering observations, leaving only those observations that the weak learner can handle while

focusing on developing new weak learners to handle the remaining difficult observations.

Gradient boosting does this by minimizing a certain loss function through gradient decent.

Essentially, this technique gradually minimizes error until it converges on some minimum;

rather than random ”trees” as in random forest, gradient boosting finds trees that reduce

its loss. These trees are used to perfect values for “splits” and can be added together to

allow subsequent tree models to be added to improve the residuals in the predictions. This

model performed well in both detection cases; like our neural network models, gb holds

potential worth exploring further in future work.

6.1.5 Linear Discriminant Analysis (LDA)

LDA is a linear decision boundary generated by fitting class-conditional densities

to the data and applying Bayes’ rule; it fits a Gaussian density to each class, with the

assumption that all classes share the same covariance matrix.

LDA, used in BCI for its ability to detect sustained ERP patterns in EEG, has found

success in systems built on motor imagery and for assessing cognitive loads [26, 96, 147]. 86

A moderate general performer [26], research into the application of LDA in BCI systems continues today [73]. There are many different ways LDA is applied to BCI. LDA is used to both classify data as well as reduce the dimensionality of the input by projecting it to the most discriminative directions. In general, LDA does not generalize well across users for EP-based tasks, such as the P300 speller.

For our experiments, we used LDA without attaching any feature selection or com- panion algorithms to it; of course, we did this for all the algorithms we reviewed. We used a singular value decomposition solver often recommended for data with a large number of features. The value for the absolute threshold (used to estimate the ranking of our feature values) was set at 1.0e−4. Since we used no priors, the class proportions are inferred from the training data.

6.1.6 Logistic Regression (LR)

Much like LDA, Linear logistic regression is widely used in discriminating motor imagery patterns in EEG [98, 114, 223], with various rates of success in other forms of

EP [141, 162]. It is included here so that we may examine how well this linear method performs on multivariate, nonlinear, non-stationary signals with no hand-crafted features selected to classify. We implemented this algorithm with Pytorch using a single linear layer coupled with a sigmoid activation, A Method for Stochastic Optimization (ADAM) optimizer with L2 penalty (constant rate of 1e−3), and binary cross entropy loss function.

6.1.7 Riemannian Minimum Distance To Mean (MDM)

As briefly mentioned in Chapter 4, of recent interest in BCI research is the application of Riemannian-based classifiers; there are many [126,188,251,258]. We chose to work with a

Riemannian minimum distance to mean classifier (RMDM or MDM) [22,188,251] to see how 87 well it performs against the other classifiers in this list and also to gain a better understand of how well it performs under varying context on raw EEG features.

Riemannian models belong to a class of metric-based classifiers, where the Rieman- nian metric defines an N-dimensional manifold; we utilize this metric to measure distances

and to map points between localized N-dimensional Euclidean space and N-dimensional

Riemannian space (i.e. by projecting between localized space). Riemannian models are

dependent on a very simple, yet powerful property of symmetric positive definite (SPD)

matrices that guarantee a smooth, differentiable manifold for any number of dimensions. It

is a fully deterministic and parameter-free classifier, thus no parameter needs to be tuned

by cross-validation or other methods that may jeopardize generalization of techniques that

need to select parameters using cross-validation.

6.2 The Experiments

Classifiers were trained to detect two different pieces of information from the EEG record-

ings:

If a given subject was part of the high-risk (HR) alcoholic group or the low-risk (LR) • control group. This HR/LR data was labeled 1=alcoholic, 0=nonalcoholic. Whether a given subject trial resulted in a match or no-match scenario (labeled as 1 • = match, 0 = no-match).

These things would be revealed by training our models on nothing more than the raw

EEG data in the dataset.

The dataset was split into three sets: (1) the training set (70%), the validation set

(10%), and the test set (20%). Keep in mind that 70/10/20 splits were based on subjects,

thus there was no overlapping subjects across the data sets; this means that subjects used

in training were not present in the validation set, nor in the test set. 88

There are several aspects to performance that we measured for each classifier:

Which task a given classifier performed best at: (1) short elicited response patterns • (match vs no-match) or (2) sustained ERP patterns (as in clinic classification or motor

imagery). Which classifiers performed well with increased channels (which may increase both • signal and noise). Which classifiers performed with less training data. •

We performed in total 126 experiments under 2 task groups: (1) match vs no-match and

(2) high-risk (HR) vs low-risk (LR). There were 9 experiments performed for each of the 7 classifiers for each task group. For each group we performed 3 sets of tests where each set used an increasing amount of the allocated training data (i.e. increased amount of trials for each subject). Within each test we incrementally increased the number of channels used during the training process. The tests progressed as follows:

Each test in each experiment started out using 10% of the trials for each subject of • the original 70% training set, then the training set was incremented to 50% of the

trials, and finally 100% of the subjects’ trials. Within each test, we started the test with 8 channels, then 12 channels, and finally • 60 channels. Selected channels correspond to typical BCI configurations (see Figures

6.1, 6.3, and 6.2). Performance for each classifier was recorded, ROC curves were compared relative to • all other classifiers withing each experiment. Thresholds were evaluated to determine how well each classifier would perform under • a given context (critical vs non-critical). 89

Figure 6.1: All channels used in the 60-channel tests are labeled.

Figure 6.2: Red highlighted channels included in the 8-channel tests. 90

Figure 6.3: Red highlighted channels included in the 12-channel tests. 91

6.3 Receiver Operating Characteristic (ROC) Curves

Receiver Operating Characteristic (ROC) Curves show the trade-off between true positive rates (TPR) and false positive rates (FPR) by performing different thresholds on classifiers’ output score. Here, we examine experimental results for both the shape of the curves as well as the area under the curves (AUC). To see additional ROC curves resulting from these experiments, see Section A.1 under Appendix A.

Each point along the curve corresponds to a different threshold. • One curve is “better” than another in aggregate if it is further up and to the left or • the area under the curve (AUC) is closer to 1.0. Keep in mind that parts of an aggregate inferior ROC curve can perform “better” • than an overall better performer at certain thresholds (for example, when two curves

cross). Diagonal ROC curve indicates random chance. • 92

Figure 6.4: ROC curves and AUC for match/no-match experimentation 93

Figure 6.5: ROC curves and AUC for high-risk/low-risk experimentation 94

6.4 Match Vs No-match Experiment

This experiment allowed us to explore how well our chosen classifiers could perform in

BCI designed for actionable EP. These types of systems could allow you to spell, communi- cate, browse the internet, adjust environmental controls, flip lights on of off, etc.. Typically, they are built around VEPs, or other early to mid-level latent responses such as P300 (see

Chapter 4 for more details).

Given the curves in Figure 6.4, there are a few interesting things to note about our classifiers. Starting with the poor performer in the match/no-match experiment. The

Riemannian (RMDM) model did not perform at all, performing no better than random chance; this was true for any test, regardless of the number of channels or the amount of training data. Clearly, the RMDM classifier does not perform well as-is against raw EEG data where detecting short EP patterns is a requirement. One paper saw similar results on this type of EP (P300), but surmised it was due to applying a prior spatial filter to the data [54]. The authors’ conclusion is highly suspect, since our data has no‘ such filter applied to it.

When the amount of training data was low, LDA showed an interesting trend of increasingly better performance as the number of channels (i.e. feature space) increased.

In fact, LDA outperformed all other classifiers but gradient boosting (GB) when all 60 channels were used with the lowest amount of training data (examine the left-hand column of curves in Figure 6.4. LDA contends well with small amounts of high dimensional data.

However, as the amount of data increases other more robust classifiers catch up and then surpass the performance of LDA. This suggests that LDA does not generalize well across subjects and it doesn’t contend well with increasing noise. As we know, EEG for any subject is non-stationary over time; this means that LDA would require re-training for any 95 user over a short period of time. Theoretically, however, LDA could be used with limited

BCI applications that have access to no prior training data.

Unlike LDA, the logistic regression (LR) model scaled better as the amount of train- ing data (trials) was increased, showing good stability with 8 and 12 channels; however, performance dropped considerably with 60 channels. Performance was as bad, or worse than the RMDM classifier. This suggests that LR cannot scale well with high dimensional feature space, but generalizes across subjects where LDA cannot. Examine the top two rows in Figure 6.4.

Both of our ensemble methods, random forest (RF) and gradient boosting (GB)

showed stable performance across all the tests. Boosting shows a consistent edge in perfor-

mance over random forest as well as a bit more stability, albeit a slight edge in both cases. In

fact, GB was outperformed only by our convolutional neural network (CNN) model when

the full test set and all channels were in play; nevertheless, GB proved stable across all

test. This suggests that ensemble approaches are remain powerful, relevant models to BCI

systems. While more recent research (including the work in this dissertation) remains cen-

tered on deep learning and Riemannian models, ensemble methods seem to hold untapped

potential awaiting further research.

Finally, we come to our deep learners. Both feed-forward neural network (FFNN)

and the convolutional neural network (CNN) were the overall top performers, suggesting

why deep learning is getting a lot of attention. Overall, the CNN model outperforms our

FFNN as the data and number of channels increases. The FFNN shows some potential to perform slightly better on smaller quantities with a lesser feature space. Since there are any number of configurations for either of these models, we use these simplified models to provide us with two ideas: (1) deep learning models show potential for good performance and 96 generalization across subjects, noise (i.e. variance) and feature space with both temporal and spatial properties.

For raw EEG, both ensemble and deep learning models have demonstrated their ability to correctly learn features on their own under various constraints related to feature space and available data. Furthermore, these models show a remarkable ability to deal with noise and variability in EEG data with EP, suggesting that they are good generalizes across users for BCI systems.

6.5 High-Risk Vs Low-Risk Experiment

This second round of tests evaluates how well our models might perform for BCI sys- tems meant for clinical diagnostics or motor imagery. These types of ERPs carry sustained patterns across longer periods of time, and are usually consistently cyclical in nature. These

ERPs can help clinicians identify pathological patterns in a subjects brain, or allow a user to move a prosthetic are, hand, leg or wheelchair. Initially, we had expected results similar to the math/no-match experiment. Surprisingly, the results were very different. We refer you to Figure 6.5.

One of the most unexpected results were with the RMDM model; it is clearly the top performer in these tests. No other classifier came close, with RF coming in at a distant second. While it’s clear the RMDM cannot content with the shorter, noisier EP related tasks performed in the previous experiment, it shows here that RMDM does have an ability to draw out other patterns that none of the other classifiers could learn with these experiments.

Clearly, Riemannian space provides ways to differentiate complex patterns that otherwise are difficult to see or learn. However, there are limits. RMDM remained stable and performs well regardless of how much training data was used. In fact, performance increased when we stepped up from 8 channels to 12, but quickly deteriorated, becoming less stable when 97 incorporating all channels. This demonstrates that there is clearly a limit to the dimensions in a given feature space. This paper seems to back up this assumption [188]. The other classifiers performed poorly here, and showed a fair amount of instability as well.

Both CNN and FFNN performed well against all channels and all subjects, but under

performed with less of either. This is expected, since it’s known that deep models perform

better with more data. The lack of large training sets for BCI presents the greatest challenge

in advancing the domain; therefore, turning towards transfer learning for performance gains

seems prudent at this juncture.

6.6 Conclusions

This chapter provides the foundation we upon which we further build our transfer

learning experiments. We will use these results as a starting point so as to provide an apples-

to-apples comparison (i.e. baseline) with our transfer learning models. Through these

experiments we were able to sort out which classifiers perform well under a given context,

and have established some good, stable performers for various types of BCI systems. CHAPTER 7 TRANSFER LEARNING EXPERIMENTS

A major component of this study is the exploration of classifiers that seem bet- ter suited to generalizing across human electroencephalographic (EEG) data. The ideal classifier would offer a high degree of scalability across large quantities of heterogeneous electroencephalographic data while requiring relatively little data and time to customize for any particular application or user. Here we explore how closely we might approach this idealistic model through transfer learning techniques. Transfer learning falls under a more general category of machine learning referred to as representation learning [84]. Our interest in transfer learning for BCI is to enable improved generalization across EEG and users of BCI systems. Our initial research and experimentation in this area leads us to believe that transfer learning may assist in the development of classifiers that offer superior performance over classifiers currently applied to EP BCI.

7.1 Transfer Learning for BCI

Recall some of the challenges inherent in classifying EEG (covered in-depth in Chapter 4):

EEG has a very poor (low) signal-to-noise ratio (SNR). • Sampled EEG data tends to be noisy, with high dimensionality (multiple channels). • EPs in EEG have an explicit dependency on the time variable (it is time-locked with • the stimulus), making more input from the past necessary. EEG is highly non-stationary: mean, variance, and frequency tend to change over • time. EEG contains biological artifact such as eye blinks, muscle artifact, and head move- • ments. 99

These issues make BCI calibration tedious for the user. To use BCI for the first time an individual must sit through a lengthy ”calibration” session to train the underlying classifiers. For EP BCI, this process can take anywhere from 15 minutes to more than an hour. Typical classifiers such as LDA or PCA are not well-suited to the non-stationary nature of EEG. Even after a system has been trained, the user will oftentimes perform this calibration several times in a single day or between days. The non-stationary properties of EEG are affected by user fatigue, concentration, and normalization (brain normalizes to stimuli). The low SNR may present an even greater challenge to the practical usability of

BCI. Because of extremely low SNR, EP is detected only after averaging across many trials, effectively filtering out the noise and eventually producing the EP for classification. This results in extremely slow performance for the user.

Some effort has gone into applying transfer learning techniques to BCI in recent years [18, 164, 170, 240]. However, applied deep transfer learning for BCI is still in its infancy. At the time of this writing, most recent papers on BCI and transfer learning since

2018 appear to focus on the utilization of deep learning models [64, 165, 218, 239, 264]; we explore some of the more promising deep transfer learning approaches for BCI, using them as a foundation for our own models and experiments.

We focus on approaches from recently published works in transfer learning that demonstrate potential for good generalization for BCI. We feel that deep transfer learn- ing for BCI is gaining more attention because deep learning consistently demonstrates the ability to generalize for noisy data better than other techniques. Several papers published between 2019 and 2020 claim to beat traditional BCI classifiers, or even outperform state of the art methods [133, 217, 229, 264]. What sets our work apart is how we assess and identify good performers for BCI and how these classifiers settle into specific types of BCI 100 systems that can work with EP as well as a broader range of ERP; this work shows that there may not be a one-size-fits-all approach to covering EP and ERP applications with transfer learning.

7.2 Transfer Learning Theoretical Foundations

When discussing transfer learning, we shall refer to tasks and domains. Therefore, given a source domain DS and a (learning) task TS, a target domain DT and its task TT , transfer learning seeks to improve the learning of the predictive target function f ( ) in T · D by using knowledge in D and T , assuming (D = D ),or(T = T ). A domain is T S S S 6 T S 6 T a set D = X,P (X) , where X is the feature space and P (X) is the marginal probability { } distribution. Ataskisaset T = Y,f( ) , where Y is the label space and f( ) is the function { · } · used to predict the corresponding label f(x), where x X [240]. For purposes of this study, ∈ a domain D is a subject or number of subjects whose EEG data is used to train a specific

model, and a task T is mapped to EP or ERP we wish to associate with actionable tasks.

It is necessary to offer the reader a more narrowed explanation with respect to how we

approach transfer learning for this study. Transfer learning is actually a broader field that

encompasses several approaches towards generalization and information transfer; we do not

explore all types of transfer learning here. There are (roughly) two categories of transfer

learning: (1) the transfer of discriminative information and (2) the transfer of stationary

information [240]. We work inside of the envelope of the former, discriminative information

transfer.

Discriminative information transfer is typically applied to domains where sample sizes

are (relatively) small, oftentimes helping to avoid over-fitting while also improving the

quality of classification; these qualities are desired for BCI where datasets are orders of

magnitude smaller when compared with other fields in machine learning such as vision 101 or natural language processing (NLP). Since we are exploring ways to generalize on raw

EEG, our approach avoids applying any direct feature extraction to the data, and this is the main reason for employing discriminative transfer; we forgo techniques that rely solely on stationary transfer, such as feature-representation-transfer approaches like spatial

filtering [46, 89, 126]. Furthermore, we focus on two major approaches that fall (largely) under discriminative information transfer: (1) Instance transfer, and (2) Classifier transfer.

The instance transfer approach transfers prior knowledge by utilizing the weights learned on data from a given source domain. This approach is different from a feature representation approach where discriminative information is transferred via some sort of applied feature representation transformation from the source domain [188,240,258]. While we avoid applying direct feature extraction to the data, working within those confines we also try to avoid narrowly descriptive transfer learning categories. For example, domain adaptation and transfer learning are oftentimes used as synonymous terms or meaning; however, there is a distinction - domain adaptation being a subcategory within transfer learning.

A widely accepted distinction is described thusly: domain adaptation is typically given a set of identical labels (Y = Y ) with differing marginal distribution P (X) = P (X), S D S 6 T whereas, in the general sense, transfer learning is given a set of differing labels (Y = Y ) S 6 D with related marginal distribution P (X) P (X) [170]. Note that this distinction is con- S ∼ T tested and hardly agreed upon. However, this dissertation makes no distinction between the

two terms - unless we feel it’s important to differentiate, we shall use transfer learning when discussing techniques designed to transfer prior knowledge from one domain to another. 102

7.3 Methodology

We use the same dataset for these experiments as we did in Chapter 6; review Chapter

5 for details about the dataset. We experiment with adapting some of the deep learning- oriented tasks from recent studies for our own purposes; this approach allows us to quickly analyze and identify techniques that are the most performant. We evaluate results from a number of experiments:

Pre-train then fine-tune (w/ and w/o weight-freezing; same task) • Pre-train then fine-tune (cross-task) • Pre-train w/ auto-encoder then fine-tune • Pre-train w/ Siamese loss then fine-tune •

Pre-training consists of the initial training our base CNN model and all of its layers.

Fine-tuning is where the transfer learning happens. For the base model, this entails either

freezing weights in the convolutional layers or not. The sections that follow elaborate on

each experiment listed above.

Our subject data is split into typical partitions: 70% training set, 20% test set, and

10% validation set. This partitioning is used for our original experiments, holding the

validation set aside to validate model hyperparameters; however, we partitioned the data

differently for our transfer learning experiments. When training our baseline CNN, we

combined the training and validation sets into a single training set (using roughly 80% of

the subject trials). For our transfer learning experiments, we used an increasing sequence

of proportions of the remaining data from the transfer learning partition (i.e. the remaining

20% of subjects) for three rounds of fine tuning (F1 through F3), leaving the remaining

data (T) for validation. 103

The sequence starts with using 25% (F1) of the entire tuning partition (F1, F2, F3, and T), then 50% (F1+F2), and finally 75% (F1+F2+F3); T (the last 25%) was consistently

withheld for evaluation. It should be clearly noted here that unlike the benchmark partition,

the transfer learning partition (specifically T1, T2, and T3) did include overlapping subjects

for fine tuning, but with no overlapping subject trials. Refer to Figure 7.1 for data partition

details.

In addition, we segmented experiments by incrementing the training data and number

of channels used in the baseline model in the same way we had done for the experimental

comparisons in Chapter 6. In total we performed 216 experimental tests: 54 cross-task +

81 match/no-match + 81 HR/LR.

Figure 7.1: Data partitions used for our original benchmarking (middle row) and transfer learning exper- iments (bottom row). F1, F2, and F3 (up to %75 of the remaining data) are sections of data used for fine tuning, leaving section T (25%) for testing. 104

7.3.1 The Baseline (CNN) Model

We use the CNN model constructed for the experiments in Chapter 6 as our baseline.

The CNN consists of 3 1-dimensional convolutional layers where the results from the convo- lutions are pooled and then sent to an exponential linear unit (ELU) layer, which provides the input to a sequence of fully-connected layers. The layers for the network are configured thusly: Part I: (conv1): Conv1d(64, 32, kernel size=(3,), stride=(1,), padding=(1,)) (conv2): Conv1d(64, 32, kernel size=(7,), stride=(1,), padding=(3,)) (conv3): Conv1d(64, 32, kernel size=(11,), stride=(1,), padding=(5,)) (mp1): MaxPool1d(kernel size=2, stride=2, padding=0, dilation=1, ceil mode=False) Part II: Sequential( (0): LayerNorm((4096,), eps=1e-05, elementwise affine=True) (1): ELU(alpha=1.0) (2): Dropout(p=0.5, inplace=False) (3): Linear(in features=4096, out features=1024, bias=True) (4): LayerNorm((1024,), eps=1e-05, elementwise affine=True) (5): ELU(alpha=1.0) (6): Linear(in features=1024, out features=128, bias=True) (7): LayerNorm((128,), eps=1e-05, elementwise affine=True) (8): ELU(alpha=1.0) (9): Linear(in features=128, out features=1, bias=True) (10): Sigmoid()) See Figures 7.2 and 7.3 for graph and details. The network was trained over 7 epochs

(the point at which repeated testing showed stabilized loss), each containing a batch size of 128 trials over approximately 3,700+ trials. We use cross entropy loss during back propagation, which is the negative log-likelihood over the entire data set, presuming some probabilistic model on label noise:

M E = y , log(p , ), (7.1) entropy − o c o c Xc=1

Here, M is the number of possible classes and p is the predicted probability that the

observation o is class c. In our case, we only need to distinguish between two classes: EP

and not EP. Presuming a Bernoulli model on label noise, this is equivalent to the binary 105 cross entropy loss function:binary cross entropy:

E = (y log(p) + (1 y)log(1 p)) (7.2) entropy − − −

Figure 7.2: Part 1 of our baseline ConvNet: (conv1): Conv1d(64, 32, kernel size=(3,), stride=(1,), padding=(1,)) (conv2): Conv1d(64, 32, kernel size=(7,), stride=(1,), padding=(3,)) (conv3): Conv1d(64, 32, kernel size=(11,), stride=(1,), padding=(5,)) (mp1): MaxPool1d(kernel size=2, stride=2, padding=0, dilation=1, ceil mode=False) 106

Figure 7.3: Part 2 of our baseline ConvNet continues with: Sequential( (0): LayerNorm((4096,), eps=1e-05, elementwise affine=True) (1): ELU(alpha=1.0) (2): Dropout(p=0.5, inplace=False) (3): Linear(in features=4096, out features=1024, bias=True) (4): LayerNorm((1024,), eps=1e-05, element- wise affine=True) (5): ELU(alpha=1.0) (6): Linear(in features=1024, out features=128, bias=True) (7): LayerNorm((128,), eps=1e-05, elementwise affine=True) (8): ELU(alpha=1.0) (9): Linear(in features=128, out features=1, bias=True) (10): Sigmoid() )

7.4 The Experiments

7.4.1 Pre-train, Fine-tune (w/ and w/o weight-freezing; same task)

Simple fine tuning without weight freezing begins with the pre-trained baseline model with a set of channels and proportion of the initial training set. As explained in the previous section, (regardless of the size of the initial training data set) our fine tuning method uses 107 an increasing sequence of proportions of the remaining data from the transfer learning partition, sections F1, F2, and F3. Section T is used for evaluation in all 3 rounds. See

Figure 7.1 for partition details.

The sequence starts with using section F1, then combines sections F1 and F2 for the second round, with the final round combining sections F1, F2, and F3. Sections F1, F2, and F3 include trials from overlapping subjects; however, the pre-training dataset utilized entirely disjointed subjects (not present in test) – this is to simulate a realistic use-case where during per-subject training, a base representation learned from other subjects can be used as a prior for fine-tuning a subject-specific classifier.

During fine-tuning, the base learning rate was adjusted to 1e-4 (from 1e-3 used for

pre-training) to ensure that pre-training representation knowledge was not “over-written”

and to ensure convergence. We tried two different approaches: one with no weight freezing

and another with weight-freezing on all parameters except for the last two dense layers (i.e.,

the “frozen” network parameters were not updated during fine-tuning).

7.4.2 Pre-train then fine-tune (cross-task)

Our approach to the cross-task experiments is essentially the same as the same task

(match/no-match EP) approach, except for the labels. For cross-task experiments, we

pre-train our classifier using labels from a different task (alcoholic/nonalcoholic) and use

match/no-match labels for fine tuning and evaluation.

7.4.3 Pre-train w/ auto-encoder then fine-tune

This set of experiments attempts to pre-train using an unsupervised technique, then

performs transfer learning in the standard supervised manner. Autoencoders are neural

networks designed to reconstruct the original signal from a learnt “code-space” which no- 108 tionally learns salient features from the data. This form of machine learning is referred to as representation learning. Essentially, we are looking for useful intel from the auto-encoder during training and fine tuning.

Autoencoders reduce the dimensions of the original data based on the learned at- tributes. These learned attributes serve as a representation of the original data. In our case, the autoencoder distills each subject trial down to some representative form. That representative form can then be used reconstruct an approximation of the original data, giving the network the ability to classify the data. An autoencoder works from 3 main parts: (1) the encoder, (2) the resultant data representation (output from the encoder), and (3) the decoder which receives the representative code as input and reconstructs an approximation of the original.

Both encoder and decoder are nothing more than fully-connected feed-forward (deep) neural networks. It is important to note that the representational output from the encoder is oftentimes referred to as the code, and it can be thought of as a compact summation of the original data it represents; it is essentially a feature vector of some arbitrary dimensionality, albeit considerably reduced from the input of the encoder. See Figure 7.4 for a graphical representation of a typical autoencoder.

For the encoder, we used all layers from our pre-trained CNN up to the penultimate layer. The decoder consists of several fully-connected layers with layer normalization as a regularizer on all but the final fully-connected layer. Note that we could have used transpose convolutions to construct a convolutional decoder and potentially improve performance.

The decoder is to be used only during pre-training. At transfer time, the decoder is discarded and the last two fully connected layers, initialized with Xavier weights are attached to the network. Xavier weights help avoid exploding or vanishing gradients by keeping the gradient 109 norms closer to 1 by setting a layer’s weights to values chosen from a random uniform distribution bounded between

√6 , (7.3) ± √ni + ni+1

where ni is the number of incoming network connections to the layer, and ni+1 is the number

of outgoing network connections from that layer. The standard network architecture is then

used at inference [82].

During pre-training of the auto-encoder, Mean Squared Error loss was used. For each

sample, given the ith element (n elements total where each element is the voltage at a

specific window in time for a specific channel) yi, and estimate of that element from the

autoencoder,y ˆi, this loss is given as follows. The net loss over each minibatch is the sum of these losses per sample; see equation 7.4).

n 1 MSE = (y y˜ )2 (7.4) n i − i Xi=1

7.4.4 Pre-train w/ Siamese loss then fine-tune

For these experiments, we pre-trained the CNN using a Siamese loss. Siamese net- works are similar to metric learning insofar as they aim to optimize a deep network’s rep- resentation to place samples from the same class close to each other in projected space and samples from differing classes far away. In our case, we’re trying to get the network to maximize distances between match and non-match signals and minimize distances between match/match. It is important to note here that we are not employing anything that ap- proaches the formal definition of a metric space. Formal metric spaces are far more intricate 110

Figure 7.4: A typical autoencoder ( c towardsdatascience.com). Input first passes through the encoder to produce the representative code. The decoder, which has a similar “mirrored” structure as the encoder, reproduces the original input using only the code as input. and complex, and it is beyond the scope of this study to elaborate on the topic; here we are using metric learning in the descriptive sense.

A Siamese network requires that there be two identical subnet architectures as well as shared weights between them. The typical output of a Siamese network is binary; indicating whether or not inputs belong to the same class. When training a Siamese network, training data are split into two equal-size batches during each epoch (this means that a batch size of

128 trials would be split into two equal batches of 64 trials). One batch is sent to the first subnetwork while the other is sent to the second subnetwork. The outputs of each network are typically in vector form. In our case there are two vectors, F1 and F2, corresponding to the penultimate layer of our stock CNN. We take the Euclidean distance between these vectors and then pass that through a sigmoid. Loss is evaluated in terms of binary cross entropy on the label. During fine-tuning, we use the shared weight network representation to initialize all of the weights on our model except for a final hidden layer which is initialized using an Xavier weight initialization. 111

Note that the forward pass of the baseline network is replaced with a Siamese pass that essentially utilizes only a subset of the fully connect layers in the baseline CNN. The output from the two subnetworks are two vectors, F1 and F2. The Euclidean distance between the two vectors is calculated (see Figures 7.5 and 7.6) and then passed through a sigmoid activation layer. The loss from this sigmoid activation is computed using binary cross entropy. See Figure 7.6. Note that when we apply fine-tuning to this model, it is applied in the same manor as it is for the baseline model.

n D = v F 1 F 2 2 u { i − i} uXi=1 t

Figure 7.5: D is defined as the Euclidean distance between all outputs (F 1 and F 2) of the Siamese network.

Figure 7.6: A typical Siamese network construct. The two identical subnetworks are defined as a series of fully connected layers. Both networks accept input from our convolutional layers. The outputs from the Siamese subnetworks are feature vectors F1 and F2, which are further reduced feature representations.

7.5 The Results

We only show relevant ROC curve results here as it’s not feasible to display all 72

related charts in this chapter. That stated, we discuss some of the poor performers as well.

All of our evaluations are in relation to the baseline training for the respective experimental

group (match/no-match, HR/LR, and cross-task). 112

Figure 7.7: ROC curves and AUC for match/no-match transfer baseline.

7.5.1 Match vs No-match

This set of experiments is compared against the baseline CNN in Figure 7.7. Recall that this model is the same CNN used in Chapter 6 experiments, trained on fewer epochs,

7 as opposed to 200. As can be seen from the ROC curves, the baseline starts out as a fairly poor performer and somewhat unstable across training data, channel, and tuning proportions. The best performers were the simpler pretrained, fine-tuned CNN models.

Performance difference between weight freezing and no-freeze is inconclusive. Depending how the training data was randomized, we could see a slight performance gain using either 113

Figure 7.8: ROC curves and AUC for match/no-match pre-trained, fine-tuned, and with weight freezing. method. This is most likely due to the lack of training data; we require more training data to determine performance related to weight freezing (examine results in Figure 7.8). These results represent a 16% AUC improvement over the baseline, and a marginal improvement of 6% over the results of the original experiments; however, with a much improved ROC curve, indicating improved performance and stability over every point (i.e. threshold) along the curve.

A word of caution here. Choosing the “best” curves from Figure 7.8 based on AUC alone would not be advisable; note that some of the curves cross. In instances where ROC curves cross, you must consider the context of the classifier and which matters more to a 114

Figure 7.9: ROC curves and AUC for HR/LR baseline. given BCI system, a higher TPR or a lower FPR [88]. For example, a BCI system that provides mobility based on P300 would be safer using a threshold that results in a lower

FPR. Despite the slower performance of the system, it would be less likely to interpret an unintended move. On the other hand, a speller with an auto-correction feature could provide fast and reasonably accurate performance with a higher FPR, because false positives may easily and quickly be compensated for by the auto-correction feature (assuming a good

TPR). 115

Figure 7.10: ROC curves and AUC for HR/LR with Siamese network, no weight freeze.

The poorest performers here were the autoencoders with weight freeze. Still, the autoencoder demonstrated more stability and overall better curves over the baseline (albeit it slight). See Appendix A for details.

7.5.2 High-risk vs Low-risk

The curves for Siamese network without weight freezing show substantial performance over the match/no-match tasks, and show a moderate overall gain over the baseline. Sur- 116 prisingly, even the baseline performed strongly here (see Figure 7.9), outperforming the original CNN model in the previous experiments, as well as the RMDM model. What’s more, the worst performer for this category was the strongest performer for the match/no- match experiments: the pretrained, fine-tuned CNN model, performing no better than or below random chance. Examine Figure 7.10 for details.

7.5.3 Cross-task Results

All cross-task experiments were poor performers, coming in far below the baseline, slightly better than random chance, and unstable. This was true regardless of training data proportion or number of channels used.

7.6 Conclusions

Transfer learning can substantially augment performance and reduce training data requirements. For example, comparing our baseline performance in Figure 7.7 with our binary transfer results in Figure 7.8 shows that even when using 1/3 of the transfer dataset to pre-train, our model substantially outperforms the baseline trained on the full transfer set by nearly 10% in terms of area under the ROC curve. For the HR/LR results there are performance gains across all training sizes.

These experiments brought to light several important factors. The most prominent evidence here is that there is no single classifier in the bunch that performs well across tasks.

For short latent EP, our fine-tuned binary CNN model outperforms every other method we explored in this work, including our selected traditional BCI classifiers. For long, sustained

ERP patterns our Siamese approach outperformed all others by a large margin. For both of these classifiers weight freezing is inconclusive for less data; however, when the amount of data increases, evidence suggests that releasing this constraint may allow for continued 117 increase in performance. The performance trend continued upward as we increased the amount of data used for both training and fine-tuning, until we ran out of data. Since our dataset is extremely small when compared to datasets used in other domains such as vision or NLP, this evidence suggests that increasing our dataset will yield improved performance in both of these top performers. This is something to explore with larger datasets we may collect in the near future.

From these experiments we can see that deep transfer learning provides good, stable performance across dimensions and data size over our traditional BCI classifiers. These classifiers generalized well across subjects within a given context (i.e. task), but did not provide any performance in cross-task scenarios. Regardless, judging from this evidence, it may be possible to couple multiple classifiers in a single BCI engine to general functionality out of the box for a multitude of BCI-related purposes, simplifying the construction and configuration of commercial systems in the future.

Weight freezing helps (on average), but not much and not consistently. We suspect that generally this constraint isn’t helpful. We require more data and experimentation to make any conclusions.

Unsupervised pre-training using an Auto-Encoder as well cross-task pre-training (sur- prisingly) reduce performance below the baseline model. This suggests that for pre-training to be useful, the tasks (labels) must be similar in addition to the data modality. This does not necessarily preclude the possibility of leveraging data across tasks or in an unsuper- vised manner when learning from EEG, but it does suggest that it is respectively harder to do so in comparison to other techniques with which we were able to achieve gains (e.g., pre-training a CNN or utilizing a Siamese network). CHAPTER 8 INTERPRETABILITY

To better understand what our models are learning and why they make certain pre-

dictions, we shall explore a handful of techniques that apply particularly well to the mod-

els we’ve chosen to experiment with for this research. Since there isn’t a one-size-fits-all

method towards interpretability, we present evidence for our chosen approach for BCI EEG interpretability: integrated gradients, SHAP, and LIME. Before unpacking some of these methods we shall further explore what interpretability is, why it has become so important in the field of machine learning, and why it’s important here.

In recent years interest has grown around the question as to how we might determine whether certain ML models can be trusted to function properly, once they’re deployed into the real world. Specifically, can we know how our models conjure their predictions, and can we determine whether or not those outputs are trustworthy [60,156,185]? Furthermore, is it possible to ascertain other qualities about our models, such as their level of safety, fairness, or reliability? Unlike metrics commonly used to analyze performance (accuracy, sensitivity, specificity and efficiency), the safety and trustworthiness of our models are rarely precisely quantified; however, these seemingly ancillary objectives may become very important in a given application and setting [115, 222, 235]. For example, BCI holds much potential for use as a tool in clinical settings. Such applications may aid clinicians and patients in the diagnosis and treatment of addiction disorders, learning disabilities, depression, dementia, epilepsy and more [6, 24, 41, 153, 157, 167, 190, 219]. Under these circumstances, we must ensure the highest trust and reliability in our models. 119

How might we train our models towards these objectives while articulating the results in some meaningful and quantifiable way, an explainable way? Here, we look towards a field

of study that explores methods and metrics designed to make our models more transparent,

or interpretable. These interests particularly apply to systems of deep learning and ensemble methods because of their black-box nature, their inner workings hidden from users and sometimes even from the designers of the models [14, 27, 60, 81]. It should be noted that prior to mainstream use of deep learning models, interpretability was heavily explored for application mainly to fuzzy models and ensembles [12, 13, 76, 99, 100, 119, 178, 265].

8.1 AI’s Black Box

During training, ML models learn relationships between inputs and labels (i.e. weights); afterwards, these models cannot know why a certain input should receive one label vs. an- other. This is particularly true in the context of deep learning. Deep neural networks are oft referred to as black boxes; you put something in, and you get something out. How (let alone why) these predictions were made is anyone’s guess. To deem the predictions ”passable”, one only requires some metric-based score, the result of some analysis over predictions when coupled with ground truth. This process, though quite common, oftentimes leads to less than desirable performance in the real world. Furthermore, the inability to determine ex- actly why a given model performed poorly on a given instance is not easily knowable. This is the very reason for which interpretability methods are explored and developed, which suggests that predictions, coupled with calculated metrics, do not convey the entire story behind our models, nor about the data we feed to them. 120

8.2 What is Interpretability?

Interpretability yields qualitative information about our ML models; it provides a representation (i.e. explanation) of what’s happening inside our black box. It is a means whereby we may build up trust in our models by providing intuitive ways for humans

(i.e. subject matter experts) to examine the basis for predictions coming from our algo-

rithms [131,186]. To be clear, there is no mathematical definition available for interpretabil-

ity [36]. Additionally, there is no specific approach towards creating interpretable informa-

tion for a given ML model or dataset; it’s highly contextual in nature [36, 60, 131, 191].

There are efforts underway to formalize interpretability models and methods. In fact, there

are some proposed methods, categories, and approaches to interpretability, given a set of

circumstances, that may serve as rough guidelines [60]; however, there is no consensus on

anything formal at the time of this writing.

There are essentially three categories of interpretability models one may choose to

apply: pre-model, in-mode (intrinsic), and post-model (or post hoc) [36]. Pre-model is

actually an extension to data analysis, and this stage oftentimes provides us with the initial

insights we use to construct our ML models. In-model methods are built into ML models

where the model itself is interpretable [36,191]; we do not explore any models or methods in

this category. Post hoc approaches are applied to the output from models that are consid-

ered to be black boxes or too complex for a human to understand. Post hoc interpretability

is also referred to as model agnostic, as it doesn’t typically rely on specific processes going

on inside of the original model.

We review several interpretability techniques in this chapter: Local Interpretable

Model-Agnostic Explanations (LIME) [185], Kernel SHapley Additive exPlanations (Ker- nelSHAP) [137], and Integrated Gradients (IG) [215]. 121

8.2.1 Explainability

The key to interpretability is to offer intuitive explanations about a complex ML

model. These explanations should contain simplified and insightful information into a

complex model’s predictions. Interpretability and explainability are oftentimes used in-

terchangeably; however, there is a subtle differentiation one should keep in mind when

formally discussing interpretability for ML. In the context of interpretability, the act of

interpreting a complex model is essentially useless if the interpretation is no more useful

than its original output.

A good interpretability method will yield a simplified ”explainer” model (i.e. an

analog) that may be used to gain insight into what the more complex model is doing. Good

explainers are not always easy to construct because they will not (cannot) have perfect

fidelity; otherwise, explainers would equal the original complex output of our models [191].

This means that explainers may be inaccurate representations of a black box model. Careful

consideration must go into choosing an interpretable approach as well as methods used to

explain the interpretations.

Our research explores visual representations over our explainer models that we may

use to determine which features in the EEG our models are learning during training. Our

explainers allow us to better understand our models and their performance differences. This

insight allows us to determine which models have the potential to perform better in the real

world.

8.3 Global vs. Local Interpretability

The notion of global interpretability does exist; however, it is rarely practical in

application. We can imagine that global interpretability would require knowledge about 122 the distribution of a model’s output based on all its features; this essentially requires that we already have an answer for how the trained model conjures its predictions. For this to be possible on any practical level the model would have to be extremely simple to begin with.

For all practical purposes a model like this could be considered self explanatory [36,131,156].

The basic approach for all black-box approaches towards interpretability involves cre- ating localized explainers, i.e. explaining a single prediction. Explaining a single prediction is in essence approximating a relatively small region of interest in our model and produc- ing a relatively simplistic analog to it. This surrogate model is considered a reasonably good, interpretable approximation [192]. Interestingly, it is suggested that local explana- tions can, in fact, be more accurate than global interpretations, because local predictions might only depend on simple, linear relationships with features rather than having more complex dependencies [36, 156].

We can approximate something that approaches global interpretability by aggregat- ing and joining many local interpretations, treating them as if they represent an accurate sampling of the entire dataset [131,156], which is why we apply the technique here. We are assuming local high fidelity, allowing us to build a global approximation piece-wise using locally linear models. In fact, for our purposes, we rely on a common linear class among the interpretability techniques used in our analysis, additive feature attribution.

8.4 Additive Feature Attribution

Of course, the perfect explanation for a given model would be the model itself, pro- vided it was simple enough; this would result in an explainer with perfect fidelity. With the exception of simple linear models, or models with a very small feature space (perhaps 4 or less dimensions), explainers for complex models (ensemble methods or deep networks) must use simplified models that approximate the original model. The authors of [137] provided 123 evidence that many popular local interpretability methods, including the LIME, SHAP, and

IG methods used in this study, employ the same explanation model: the additive feature attribution (AFA) method. AFA has three main properties which we review, local accuracy,

missingness, and consistency.

Let f be the (complex) model we will interpret and g be the explainer model. f(x)

is a prediction based on a single sample input x. Our explainers accept simplified input x′

that map to the original input through the mapping function x = hx. Local methods will

(reasonably) ensure that g(z′) f(h (z′)) whenever z′ x′. Keep in mind that because h ≈ x ≈ x ′ ′ is specific to the current input x, hx(x )= x, despite x containing less information than x.

n ′ ′ let g(z )= φ0 + φizi, (8.1) Xi=1 where z′ 0, 1 n, n is the number of simplified input features, and φ R. These ∈ { } i ∈

models all attribute an effect φi to every feature, and summing these effects approximates

the output of the original model prediction f(x). In short, the AFA method is simply a

linear regression over a binary mask where deviations are guaranteed to be linear for local

instances (think locality). This is also referred to as local accuracy.

n ′ ′ f(x)= g(x )= φ0 + φixi, (8.2) Xi=1 In addition to this property, AFA method brings out two more ideal properties among

the group of models that employ it; we may use these properties to compare various AFA

explainer models against (with only SHAP possessing all three properties).

Property 2 is referred to as missingness; it constrains certain features to have no

attributed impact where x′=0.

x′ =0= φ = 0 (8.3) ⇒ i 124

Property 3 is consistency; the idea behind it states that an input’s attribution should not decrease if a model changes so that some simplified input’s contribution increases or stays the same regardless of other inputs, that input’s attribution should not decrease

[137, 156].

Let f(z′)=f(h (z′)) and z j′ indicate that z′ = 0 for any two models fandf ′ that x \ j satisfy:

f ′ (z′) f ′ (z′ ) f (z′) f (z′ ) (8.4) x − x \j ≥ x − x \j

for all input z′ 0, 1 M , then: φ (f ′,x) φ (f,x) ∈ { } j ≥ j

8.5 Shapley Values

Shapley values are computed according to a concept known as cooperative gaming

theory, where the each player’s contribution during a game is determined for the payout

(outcome) of that game [244]; this is also referred to as game theory attribution. Shapley

values are simply the average of the marginal payout for each player (i.e. each feature value)

in a given game (i.e. prediction). Put another way, the Shapley value of a feature value is

the contribution of that feature to the model’s prediction, weighted and summed over all

possible combinations of feature values. This method is model agnostic, and will work for

both linear and non linear models. Let us observe the following [137, 156]:

S !( v S 1)! ϕ (val)= | | | |−| |− (val(S v ) val(S)) (8.5) j v ! ∪ { j} − S⊆{v1,vX2,...,vn}\vj | |

where S is a subset of the features in the model, v is the vector of feature values of the

instance to be explained and v is the number of features and ! is factorial. val (S) is the | | v prediction for feature values in set S that are marginalized over the features that are not 125 included in set S:

∧ ∧ valv(S)= f (v1,v2,...,vn)dn EX (f (V )), (8.6) Z v/∈S −

∧ Where f (V ) is the model prediction over all features, EX is the mean effect estimate function. Note that we’re required to perform multiple integrations for each feature not in

S. Consider this toy example of a model that works with 4 features: v1,v2,v3, and v4. Let us evaluate the prediction for the coalition S consisting of values v1 and v3:

∧ ∧ valv(S)= valv( v1,v3 )= f (v1,V2,v3,V4)dnV2V 4 EX (f (V )) (8.7) { } ZR ZR −

It now becomes easier to imagine the difficulties inherent in calculating all possible coali- tions of feature values for an entire feature set with more than a few values; the number of possible coalitions increases exponentially with each feature. Fortunately, there are agnostic approaches to approximating the Shapley value that allow for more practical and efficient implementations: Shapley sampling value [213], KernelSHAP [137], and Monte-Carlo sam- pling [156].

8.6 Kernel SHapley Additive exPlanations

While there are several model-specific approaches using SHapley Additive exPlana- tions (SHAP) that hone the method for more efficiency with certain models, we explored only the agnostic approach KernelSHAP here. Consistent with 8.1, πx,L, and Ω are con- sistent with properties 1-3:

Ω(g) = 0,

(M 1) π (z′)= − , x (Mchoose z′ ) z′(M z′ ) (8.8) | | | −| | ′ ′ 2 ′ L(f, g, π ′ )= [f(h (z )) g(z )] π ′ (z ), x x − x zX′∈Z 126 where M is the maximum coalition size, z′ is the number of non-zero elements in z’, Ω(g) | |

is the complexity of the explanation, πx is the SHAP kernel function (which amounts to a

proximity function between the original instance x and z′), L is the loss function optimized

over the features in our explainer model g, f as the original prediction function, and Z as the training data.

The SHAP method, detailed in this foundational paper [137], is an improved approach over using Shapley values alone to produce an explainer. SHAP creates an optimization problem from the Shapley values method and uses an approximating linear kernel func- tion KernelSHAP to determine how much each feature value contributes to the overall prediction of a given data sample; this approach relates SHAP to other local models such as LIME. A major contribution of the SHAP method is that the resultant Shapley value explainer consists of a(n) linear additive feature attribution method. Unlike the Shapley values approach, SHAP explainers result in a sparse vector.

8.7 Integrated Gradients (IG)

Integrated Gradients (IG) is relatively new approach relative to LIME or SHAP; this attribution method requires choosing a baseline input hyperparameter. A typical baseline for image classification interpretations is an all-black pixel mask. We followed suite here by using a matrix consisting of all ZEROs in addition to a matrix of randomized noise. With that said, there is very little research on the impact of the attribution baseline with IG, and there is no “best” baseline to use; however, this paper [214] provides some insight into how to approach selecting a good baseline, relating it to an assumption of missingness (i.e. how the model’s output is affected when a particular attribute is omitted from the sample instance) for a given model and data distribution. 127

As do the previous two interpretability methods covered here, IG seeks to score the importance of an attribute’s contribution in a given instance towards the overall classifica- tion by the model. Given a target x, a network function f, and attribute i, IG assigns a

′ score φi(f,x,x ) for attribute i. A score close to zero indicates that the attribute in question contributed little to nothing towards the classification score. Negative or positive numbers indicate whether the score was positively or negatively affected by a given attribute.

In this seminal paper [215] the authors introduce IG as a self-evident (axiomatic) method towards explainability; additive feature attribution (i.e. completeness) can be viewed thusly: n IntegratedGrads (x)= F (x) F (x′), where x is the original input, x′ i=1 i − P is the baseline, and F : Rn [0, 1] is the output of a neural network, assuming F : Rn R → → is differentiable almost everywhere.

Formally, the integrated gradients defines the contribution of a given attribute, i for

an input x and baseline x′ thusly: let ∂F (x) be the gradient of F (x) along the ith dimension. ∂xi

1 ′ ′ ′ ∂F (x + α (x x )) IntegratedGradsi(x) ::= (xi xi) × − (8.9) − × Zα=0 ∂xi

Put another way, integrated gradients will aggregate the gradients along the inputs

that fall on the (straight) line between the baseline and the input [215]. This idea shows that

IG is among a group of methods known as path methods. Unlike many gradients methods

used to interpret neural networks, IG avoids saturation by integrating over a path, which

tells us, when scaled along this path, which attribute increased the output towards the

correct classification [214, 215].

8.8 Interpretable Model-Agnostic Explanations (LIME)

As with SHAP, LIME [185] also builds on locally interpretable analogs to produce

model explanations, i.e. interpretable representations that are locally faithful to the clas- 128 sifier. Instead of weighting features according to what the coalition would receive in the

Shapley value estimate, LIME weights are according to how close neighboring sampled per- turbations are to the original. From that information LIME learns a sparse linear model based on predictions of the perturbed neighboring samples, where features with the largest coefficients are considered to be most important. Define g G, where G is a class of po- ∈ tentially interpretable models. Since g acts over the presence/absence of the interpretable

′ features, it is essentially a feature mask with a domain of 0, 1 d . As expressed by the { } authors, LIME produces an explanation thusly:

ξ(x) = argmin L(f, g, πx)+Ω(g) (8.10) g∈G where f is the model, x is the target to explain. LIME will sample perturbed samples around x and choose a model g from some interpretable functional space G. πx is the

distribution around x, and Ω(g) is the model complexity penalty. Some variant of least

absolute shrinkage and selection operator (LASSO) trained with an L1 prior is the typical

method used for producing a sparse solution. While LIME does provide a method towards

greater transparency for black box models, it tends to harbor more uncertainty than SHAP

and integrated gradients (IG) [263].

8.8.1 Applying Interpretability to Electroencephalography

We experimented applying the above interpretability techniques to our classifiers so

that we may better understand which attributes these models were weighting most heav-

ily towards a given output. From the perspective of multichannel EEG, we’re interested

in understanding both the temporal (along the x-axis) and spatial contributes (across all

channels). See Figure 8.1 for 2D temporal vs. 3D spatial and temporal illustration. 129

Figure 8.1: Below: 2D graph of a 1-second trial with 61 channels. Above: the same trial across channels, giving us a 3D spatial perspective.

After performing various experiments, we settled on applying integrated gradients for

creating our explainers. We propose several visual techniques so that the human in the loop

(HIM) can gain insight into what the ML models are learning. Coupled with our transfer

learning approach, this could allow for the creation of a BCI system tailored towards a

specific ”out-of-the-box” use deployed with a higher level of confidence than what would

otherwise be possible.

Here, we interpret the best performers from our transfer learning models from Chapter

7. This information will lead us to determine how we might approach configuring a BCI

system with these models, and if they can ultimately be trusted to perform after deployed.

These experiments were conducted using the dataset covered in Chapter 5 and learning

models from Chapter 7. 130

8.9 Approach to Explainability: Integrated Gradients (IG)

In addition to using ROC curves and AUC for evaluating our learning models, we chose to implement an approach towards gaining more insight into what a given model might be learning. Rather than treat each model as a black box where it is ranked on scores based on outputs alone, we coupled our assessments with integrated gradients (covered in Section

8.7). IG provides an approach for explaining the predication of any classifier by learning a locally interpretable model around the predictions. We may take nonlinear classifiers, such as those used in our work, zoom in around localized areas within a given prediction, and perform an analysis in linear terms; this can be thought of as a form of simplified abstraction of the data used in the original prediction. Once we obtain this information, it is used to create explainers that become simplified, but true representations of the original data.

These explainers can be visualized in many ways, thus giving us insight that we might not otherwise know. Through these explainers we may learn whether or not to trust certain predication as well as how much we should trust the classifiers making the predication. One may think of explainers as a part of a decision support system (DSS) that may help both technical and nontechnical persons assessing the performance of complex learning models by producing interpretable models that are much easier for humans to examine and assess.

For our purposes, we are placing the human in the loop so that we may learn the regions within the EEG signals where each classifier placed the greatest importance (highest weights). Knowing this information allows us to gain a better understanding of our data as well as determine our trust in both the predictions of our models as well as trust in the models themselves. To our knowledge this work is the first to apply integrated gradients explainers to electroencephalography and BCI. 131

We implemented IG because it is proven to have superior properties out of the three methods we reviewed. We performed the following steps on the top performers for match/no-match and HR/LR. For a given model and threshold, do the following:

1. We start with the predictions (scores) related to the pretrained model. 2. For each examined sample (positive or negative) label, create an explainer:

(a) Evaluate the model using an allotment of test data.

(b) Collect and rank all TP and TN scores using a chosen prediction threshold.

(c) Aggregate the data samples across channels for the top N scores for each TP and

TN

(d) Obtain the feature attributions (i.e. weights) via integrated gradients (this results

in a 2D array of IG weights for each channel and sample therein).

(e) Using these weights, create a temporal & spatial heat map that corresponds to

each channel and its attributed factor.

(f) Similarly, we use these weights to create a temporal and spatial dipole heat map

that depicts the spatial and temporal contribution of each feature in the sample.

3. We now have a way of examining what a classifier learns; we can better visualize from

different perspectives what our models are learning. 4. We can examine each explainer to get an idea of which channels contribute positively

or negatively to the overall prediction.

We may examine the positive and negative contributions over space and time using interactive tools. This insight can lead to better BCI system tuning for systems applied to specific tasks. As an example, we can visualize what features contribute to high risk TP and TN in our Siamese model to understand what leads to a prediction. In addition, we can 132 visualize the spatial components to determine whether or not we can reduce the number of channels used.

8.10 Interpreting the Explainers

It is important that the reader does not discount the fact that examining information related to EEG requires a certain level of knowledge in neuroscience, electroencephalog-

raphy, BCI, or some combination of these topics. Interpretability and related explainers

are simply tools provided so that subject matter experts may better understand what fea-

tures our models are learning. We provide an interpretation of the explainers below so that

readers without any background may examine what these explainers show.

In EP (match/no-match) we would expect the presence of early latent activity in

EEG between 0 and 400 milliseconds after the stimulus is introduced to the subject (refer

to Section 4.1.3 for details); therefore, we should see some indication that our classifiers have

learned some sort of representation of this early temporal activity. Similarly, for sustained

ERP (HR vs LR), we might see some sort of difference in how regions of the brain react to

stimulus over time between high-risk and low-risk subjects. According to our data analysis,

normal progression should start towards the temporal region of the brain and eventually

progress towards the frontal lobes; see Figures 5.6 and 5.7. We expect our models to have

learned these differences in spatial progression between HR and LR subjects. We fashioned

our explainers to convey these details, assuming that they exist.

We decided to build 3 types of visuals around our explainers: (1) A spatial dipole

heat map representing weighted features as they relate to regions of the brain over an entire

1-second of time; represents spatial details only, (2) a joint temporal dipole heat map that

represents weighted features as they progress over the length of a trial; also shows weights

spatially and temporally for each channel, and (3) a channel heat map that shows how each 133 feature’s weighted contributions change over time within each channel. We show several examples of these explainers and how we interpreted them. Keep in mind that these images depict feature contributions (i.e. weighted features) towards a particular classification; they do not necessarily directly correspond to actual EEG amplitude we examined in Chapter 5.

8.10.1 High Risk Vs. Low Risk Explainers

Choosing a threshold of .51 we took all true positives (representing HR) and all true negatives (representing LR) and created explainers for each category. Figures 8.2, 8.3, and

8.4 depict the same things in different ways: all depict how EEG features contributed to high-risk classification. Figures 8.5, 8.6, and 8.7 all depict how EEG features contributed to low-risk classification.

Comparing aggregated dipoles across subjects in Figures 8.2 and 8.5 provides us with the spatial differences between the two groups. We can clearly see these differences when the images are juxtaposed. These images tell us nothing about what was learned of

ERP progression over time, but could be useful if we’re not concerned with knowing those temporal details; this would depend on the application.

To gain a better understanding about what our Siamese transfer models learned about temporal progression, we can compare Figures 8.3 and 8.6. The biggest differences in the way the features contribute over time seem to be between the frontal lobes and the temporal lobe. LR subjects have large oscillatory feature swings starting in the temporal lobe that contribute to most of the activity within the first half of a trial, then shifting to the frontal lobes while HR subjects seem to show features related to the frontal lobes contributing more throughout the entire trial. Interestingly enough, the LR group (also the control group) shows patterns that seem to jibe with what we’d expect to see in a normal subject. HR is clearly different in this regard. 134

If we change the temporal perspective slightly, we get yet another piece of information not readily visible in the first two representation. Figures 8.4 and 8.7 depict temporal heat maps. At first glance, these maps seem to introduce nothing but confusion; however, looking more closely at these two maps shows that is less “noise” in the LR map. This could indicate that certain LR ERP features are more pronounced, overwhelming other less relevant features, leaving those features with barely any contribution (i.e. neutral weight).

This could be an indicator that some users of certain BCI systems may require systems that are better trained on more data from their respective clinically designated risk groups.

Figure 8.2: A dipole heat map representing what our Siamese network learned over all HR tasks. This image shows aggregated feature contributions (depicted through IG) across all TP high risk (alcoholic) subjects averaged across the entire length of a 1-second trial, showing which channels spatially contributed negatively (blue) or positively (red) towards the prediction.

8.10.2 Match Vs. No-Match Explainers

Just as we did with the HR vs LR explainers for our Siamese model, we wish to gain

some understanding into what our binary transfer model is learning about match/no-match

trials. Choosing a threshold of .51 we took all true positives (representing match) and all

true negatives (representing no-match) and created explainers for each category. Unlike HR

vs. LR, we expect to see most activity (i.e. most contributions) somewhere in the vicinity 135

Figure 8.3: A joint temporal dipole heat map representing what our Siamese network learned over all HR tasks. This image shows aggregated feature contributions (depicted through IG) across all TP high risk (alcoholic) subjects through time over the entire length of a 1-second trial, showing which channels spatially and temporally contributed negatively (blue) or positively (red) towards the prediction. Below the dipole maps we are able to follow the contributions of each EEG channel through time.

between 0 and 400 milliseconds. Comparing the aggregate dipoles across subjects in Figures

8.8 and 8.11 provides us with the spatial differences between the two trials; however, we are

much more interested in knowing if our model has learned the difference between P3a and

P3b (see section 4.2).

Contrasting Figures 8.9 and 8.12 we can clearly see that there is similar progression

of feature contributions in the temporal region between 0 and 400 milliseconds - just as we

would expect from a classifier that was properly learning on early to mid latent EP features.

Notices how most features after 400 milliseconds have barely any positive contributions. Our

classifier is clearly learning the proper temporal aspects. If we examine the differences in

these contributions between match and no-match, we can see the differences made clearer

there, too.

The temporal heat maps in Figures 8.10 and 8.13 provide a better picture of which

features in each channel contribute the most in the early temporal regions. We could us

these sorts of heat maps to determine which channels we may omit to create less expensive, 136

Figure 8.4: A temporal heat map representing what our Siamese network learned over all HR tasks. This image shows aggregated feature contributions (depicted through IG) across all TP high risk (alcoholic) sub- jects through time over the entire length of a 1-second trial, showing which channels contributed negatively or positively towards the prediction. more portable BCI systems by cutting out channels that contribute less to the overall prediction coming from our classifiers. 137

Figure 8.5: A dipole heat map representing what our Siamese network learned over all LR tasks. This image shows aggregated feature contributions (depicted through IG) across all low risk (non-alcoholic) subjects averaged across the entire length of a 1-second trial, showing which channels spatially contributed negatively (blue) or positively (red) towards the prediction.

Figure 8.6: A joint temporal dipole heat map representing what our Siamese network learned over all LR tasks. This image shows aggregated feature contributions (depicted through IG) across all low risk (non- alcoholic) subjects through time over the entire length of a 1-second trial, showing which channels spatially and temporally contributed negatively (blue) or positively (red) towards the prediction. Below the dipole maps we are able to follow the contributions of each EEG channel through time. 138

Figure 8.7: A temporal heat map representing what our Siamese network learned over all LR tasks. This image shows aggregated feature contributions (depicted through IG) across all low risk (non-alcoholic) sub- jects through time over the entire length of a 1-second trial, showing which channels contributed negatively or positively towards the prediction.

Figure 8.8: A dipole heat map representing what our binary (CNN) network learned over all match tasks. This image shows aggregated feature contributions (depicted through IG) across all TP match tasks averaged across the entire length of a 1-second trial, showing which channels spatially contributed negatively (blue) or positively (red) towards the prediction. 139

Figure 8.9: A joint temporal dipole heat map representing what our binary (CNN) network learned over all match tasks. This image shows aggregated feature contributions (depicted through IG) across all TP match task through time over the entire length of a 1-second trial, showing which channels spatially and temporally contributed negatively (blue) or positively (red) towards the prediction. Below the dipole maps we are able to follow the contributions of each EEG channel through time.

Figure 8.10: A temporal heat map representing what our binary (CNN) network learned over all match tasks. This image shows aggregated feature contributions (depicted through IG) across all TP match tasks through time over the entire length of a 1-second trial, showing which channels contributed negatively or positively towards the prediction. 140

Figure 8.11: A dipole heat map representing what our binary (CNN) network learned over all no-match tasks. This image shows aggregated feature contributions (depicted through IG) across all TN match tasks averaged across the entire length of a 1-second trial, showing which channels spatially contributed negatively (blue) or positively (red) towards the prediction.

Figure 8.12: A joint temporal dipole heat map representing what our binary (CNN) network learned over all no-match tasks. This image shows aggregated feature contributions (depicted through IG) across all TN match task through time over the entire length of a 1-second trial, showing which channels spatially and temporally contributed negatively (blue) or positively (red) towards the prediction. Below the dipole maps we are able to follow the contributions of each EEG channel through time. 141

Figure 8.13: A temporal heat map representing what our binary (CNN) network learned over all no- match tasks. This image shows aggregated feature contributions (depicted through IG) across all TN match tasks through time over the entire length of a 1-second trial, showing which channels contributed negatively or positively towards the prediction. 142

8.11 Conclusions

Providing interpretable explainers can provide insight into what our models are learn- ing. For example, we may gain a much better understanding of why it’s difficult to classify short latent EP (match/no-match) tasks by examining the explainers. Notice how closely

TP and TN resemble one another in the aggregate; spatial and temporal similarities are ex- tremely narrow, and yet (astonishingly) transfer learning has enabled us to create a classifier that, not only performs reasonably well at this task, but generalizes extremely well, too. In addition, we now have a better understanding of why classification for HR vs LW (clinical) performs much better than our match/no-match classifiers; the differences are more easily differentiated. Understanding which channels or which areas of the head contribute most towards what our models learn may help in constructing specialized classifiers for use with specific BCI tasks and perhaps even with particular groups of users. CHAPTER 9 CONCLUSIONS AND DISCUSSION

This study focused on generalizing BCI classifiers using deep transfer learning models so that we might improve EP-based BCI systems, pushing the technology closer towards consumer and clinical markets.

We conducted experiments designed to evaluate many traditional BCI classifiers to better understand how they might generalize across subjects and tasks trained only with raw EEG data. The results of these experiments helped identify respective performers under both short EP and sustained ERP BCI systems, as well as provided baseline performance used to put against our transfer models.

We introduced transfer learning approaches towards generalization techniques that allow for simplified and shortened classifier training specific to EEG where performance was improved and sustained across dimensionality and training set size over pretrained models as well as traditional BCI classifiers. Our transfer models were trained using mini-batches aimed at the eventual explosion in size and availability of EEG datasets.

We provided evidence that using interpretable explainer models may provide intuitive visual representations of learned features. We examined state of the art BCI research and the difficulties in classifying EP extending from the inherent properties in EEG; we demon- strated that crafting BCI classifiers that generalize well across users and tasks is extremely difficult. We have demonstrated that deep transfer learning combined with interpretability may provide a way towards overcoming challenges with crafting good classifiers for use in

BCI systems. 144

We demonstrated that deep transfer learning allows for good performance with learned features over hand-crafted features, despite training against little data and can even improve said performance whilst increasing the feature space significantly. In addition, we show that transfer learning generalizes better across subjects.

We were not able to demonstrate a good way to generalize across tasks, however.

Instead, we uncovered good performers for the 2 main tasks that almost all BCI systems utilize: short, latent actionable EP (go/no-go, match-no-match) or sustained ERP patterns found in motor imagery, mental workload, or clinical pathologies.

9.1 Future Work

We will continue to search for classifiers that generalize across tasks as such a classifier may offer more flexibility, thus simplifying the task of building and training BCI systems.

In addition, research into better interpretable explainers could help provide a way towards building an assistive, for training and evaluating BCI classifiers. Lastly, we see more collaborative research efforts between the author and subject matter experts (SMEs) in both clinical and commercial settings as being a very important step towards building better BCI. Interpretable explainers may improve through these collaborative efforts; better explainers may positively impact our work towards discovering better BCI classifiers. REFERENCES

[1] Enas Abdulhay, Rami Oweis, Areej Mohammad, and Lujain Ahmad. Investigation of a wavelet-based neural network learning algorithm applied to P300 based brain- computer interface. Biomedical Research, pages 1–1, 2017.

[2] David Achanccaray, Christian Flores, Christian Fonseca, and Javier Andreu-Perez. A P300-based brain computer interface for smart home interaction through an anfis ensemble. In Fuzzy Systems (FUZZ-IEEE), 2017 IEEE International Conference, pages 1–5. IEEE, 2017.

[3] N. Acir, I. Oztura, M. Kuntalp, B. Baklan, and C. Guzelis. Automatic detection of epileptiform events in EEG by a three-stage procedure based on artificial neural networks. Biomedical Engineering, IEEE Transactions, 52(1):30–40, Jan 2005.

[4] Faraz Akram, Seung Moo Han, and Tae-Seong Kim. An efficient word typing P300- BCI system using a modified t9 interface and random forest classifier. Computers in Biology and , 56:30–36, 2015.

[5] Tarik Al-Ani and Dalila Trad. Signal Processing and Classification Approaches for Brain-Computer Interface. 01 2010.

[6] Abeer Al-Nafjan, Manar Hosny, Yousef Al-Ohali, and Areej Al-Wabil. Review and classification of emotion recognition based on EEG brain-computer interface system research: a systematic review. Applied Sciences, 7(12):1239, 2017.

[7] Audrey Aldridge, Eli Barnes, Cindy L Bethel, Daniel W Carruth, Marianna Koc- turova, Matus Pleva, and Jozef Juhar. Accessible electroencephalograms (EEGs): A comparative review with OpenBCI’s ultracortex mark iv headset. In 2019 29th Inter- national Conference Radioelektronika (RADIOELEKTRONIKA), pages 1–6. IEEE, 2019.

[8] Brendan Z Allison and Jaime A Pineda. ERPs evoked by different matrix sizes: implications for a brain computer interface (BCI) system. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 11(2):110–113, 2003.

[9] F Aloise, I Lasorsa, F Schettini, A Brouwer, D Mattila, F Babiloni, S Salinari, M Mar- ciani, and F Cincotti. Multimodal stimulation for a P300-based BCI. Int. J. Bioelec- tromagn, 9:128–130, 2007.

[10] F Aloise, F Schettini, P Aric`o,F Leotta, S Salinari, D Mattia, F Babiloni, and F Cincotti. P300-based brain–computer interface for environmental control: an asyn- chronous approach. Journal of , 8(2):025025, 2011.

[11] Fabio Aloise, Francesca Schettini, Pietro Aric`o,Serenella Salinari, Fabio Babiloni, and Febo Cincotti. A comparison of classification techniques for a gaze-independent P300-based brain–computer interface. Journal of Neural Engineering, 9(4):045012, 2012. 146

[12] Jos´eM Alonso, Luis Magdalena, and Gil Gonz´alez-Rodr´ıguez. Looking for a good fuzzy system interpretability index: An experimental approach. International Journal of Approximate Reasoning, 51(1):115–134, 2009.

[13] Jos´eM Alonso, Luis Magdalena, and Serge Guillaume. Hilk: A new methodology for designing highly interpretable linguistic knowledge bases using the fuzzy logic formalism. International Journal of Intelligent Systems, 23(7):761–794, 2008.

[14] David Alvarez-Melis and Tommi S Jaakkola. On the robustness of interpretability methods. arXiv preprint arXiv:1806.08049, 2018.

[15] Filippo Arrichiello, Paolo Di Lillo, Daniele Di Vito, Gianluca Antonelli, and Stefano Chiaverini. Assistive robot operated via P300-based brain computer interface. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 6032–6037. IEEE, 2017.

[16] Yashodhan Athavale and Sridhar Krishnan. Biosignal monitoring using wearables: Observations and opportunities. Biomedical Signal Processing and Control, 38:22 – 33, 2017.

[17] H. L. Attwood and W. A. MacKay. Essentials Of . B. C. Decker, Burlington, Ontario, First edition, 1989.

[18] Adham Atyabi, Martin Luerssen, Sean P Fitzgibbon, Trent Lewis, and David MW Powers. Reducing training requirements through evolutionary based dimension re- duction and subject transfer. Neurocomputing, 224:19–36, 2017.

[19] Eda AKMAN AYDIN, Omer Faruk BAY, and Inan Guler. P300 based asynchronous brain computer interface for environmental control system. IEEE Journal of Biomed- ical and Health Informatics, 2017.

[20] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

[21] Garima Bajwa and Ram Dantu. Neurokey: Towards a new paradigm of cancelable biometrics-based key generation using electroencephalograms. Computers & Security, 62:95–113, 2016.

[22] Alexandre Barachant, St´ephane Bonnet, Marco Congedo, and Christian Jutten. Mul- ticlass brain–computer interface classification by Riemannian geometry. IEEE Trans- actions on Biomedical Engineering, 59(4):920–928, 2011.

[23] Andrei Belitski, Jason Farquhar, and Peter Desain. P300 audio-visual speller. Journal of Neural Engineering, 8(2):025022, 2011.

[24] Abdelkader Nasreddine Belkacem, Nuraini Jamil, Jason A Palmer, Sofia Ouhbi, and Chao Chen. Brain computer interfaces for improving the quality of life of older adults and elderly patients. Frontiers in Neuroscience, 2020.

[25] Chris Berka, Daniel J Levendowski, Michelle N Lumicao, Alan Yau, Gene Davis, Vladimir T Zivkovic, Richard E Olmstead, Patrice D Tremoulet, and Patrick L 147

Craven. EEG correlates of task engagement and mental workload in vigilance, learn- ing, and memory tasks. Aviation, Space, and Environmental Medicine, 78(5):B231– B244, 2007. [26] Saugat Bhattacharyya, Anwesha Khasnobish, Somsirsa Chatterjee, Amit Konar, and DN Tibarewala. Performance analysis of lda, qda and knn algorithms in left-right limb movement classification from EEG data. In 2010 International Conference on Systems in Medicine and Biology, pages 126–131. IEEE, 2010. [27] Adrien Bibal and Benoˆıt Fr´enay. Interpretability of machine learning models and representations: an introduction. In ESANN, 2016. [28] B. Blankertz, K. R. Muller, D. J. Krusienski, G. Schalk, J. R. Wolpaw, A. Schlogl, G. Pfurtscheller, J. R. Millan, M. Schroder, and N. Birbaumer. The BCI competition iii: validating alternative approaches to actual BCI problems. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 14(2):153–159, June 2006. [29] Christoph Bledowski, David Prvulovic, Rainer Goebel, Friedhelm E Zanella, and David E.J Linden. Attentional systems in target and distractor processing: a com- bined ERP and fmri study. NeuroImage, 22(2):530 – 540, 2004. [30] Maria V Ruiz Blondet, Adarsha Badarinath, Chetan Khanna, and Zhanpeng Jin. A wearable real-time BCI system based on mobile cloud computing. In 2013 6th International IEEE/EMBS Conference on Neural Engineering (NER), pages 739–742. IEEE, 2013. [31] Lo¨ıc Botrel, Elisa Mira Holz, and Andrea K¨ubler. Using brain painting at home for 5 years: Stability of the P300 during prolonged BCI usage by two end-users with als. In International Conference on Augmented Cognition, pages 282–292. Springer, 2017. [32] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Van- dergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017. [33] Anne-Marie Brouwer and Jan BF Van Erp. A tactile P300 brain-computer interface. Frontiers in Neuroscience, 4, 2010. [34] Roberta Carabalona, Ferdinando Grossi, Adam Tessadri, Paolo Castiglioni, Antonio Caracciolo, and Ilaria de Munari. Light on! real world evaluation of a P300-based brain–computer interface (BCI) for environment control in a smart home. Ergonomics, 55(5):552–563, 2012. [35] Jessica Carlsen, Heidi Grabenstatter, Rory Lewis, Chad A. Mello, Amy Brooks-Kayal, and Andrew M. White. Identification of seizures in prolonged video-EEG recordings. 66th American Epilepsy Society Annual Meeting, November 30 - December 4, San Diego, CA, USA:Poster, 2012. [36] Diogo V Carvalho, Eduardo M Pereira, and Jaime S Cardoso. Machine learning interpretability: A survey on methods and metrics. Electronics, 8(8):832, 2019. [37] Alberto Casagrande, Joanna Jarmolowska, Marcello Turconi, Francesco Fabris, and Piero Paolo Battaglini. Polymorph: A P300 polymorphic speller. In International Conference on Brain and Health Informatics, pages 297–306. Springer, 2013. 148

[38] Hubert Cecotti. Spelling with non-invasive brain–computer interfaces–current and future trends. Journal of Physiology-Paris, 105(1):106–114, 2011. [39] Hubert Cecotti and Axel Graser. Convolutional neural networks for P300 detection with application to brain-computer interfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3):433–445, 2011. [40] Ka Xiong Charand. Action potentials. http://hyperphysics.phy-astr.gsu.edu/ hbase/biology/actpot.html, 2014. [41] Camille Chatelle, Camille A Spencer, Sydney S Cash, Leigh R Hochberg, and Brian L Edlow. Feasibility of an EEG-based brain-computer interface in the intensive care unit. , 129(8):1519–1525, 2018. [42] SC Chen, AR See, CK Liang, and YY Lee. Evaluating the performance of the P300-based brain computer interface for the lego page turner. In Proceedings of the 2nd International Conference on Intelligent Technologies and Engineering Systems (ICITES2013), pages 765–771. Springer, 2014. [43] Xiaogang Chen, Yijun Wang, Masaki Nakanishi, Xiaorong Gao, Tzyy-Ping Jung, and Shangkai Gao. High-speed spelling with a noninvasive brain–computer interface. Proceedings of the National Academy of Sciences, 112(44):E6058–E6067, 2015. [44] Yuqian Chen, Yufeng Ke, Guifang Meng, Jin Jiang, Hongzhi Qi, Xuejun Jiao, Min- peng Xu, Peng Zhou, Feng He, and Dong Ming. Enhancing performance of P300- speller under mental workload by incorporating dual-task data during classifier train- ing. Computer Methods and Programs in Biomedicine, 152:35–43, 2017. [45] Yu M. Chi, Yijun Wang, Yu-Te Wang, Tzyy-Ping Jung, Trevor Kerth, and Yuchen Cao. A Practical Mobile Dry EEG System for Human Computer Interfaces. In Dylan D. Schmorrow and Cali M. Fidopiastis, editors, Foundations of Augmented Cognition, pages 649–655, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. [46] Eleni Chiou and Sadasivan Puthusserypady. Spatial filter feature extraction methods for P300 BCI speller: A comparison. In Systems, Man, and Cybernetics (SMC), 2016 IEEE International Conference on, pages 003859–003863. IEEE, 2016. [47] Anna Chrapusta, Juri D Kropotov, and Maria Pachalska. Neuromarkers of post- traumatic stress disorder (ptsd) in a patient after bilateral hand amputation–ERP case study. Ann Agric Environ Med, 24(2):265–270, 2017. [48] Ching-Lin Chu, I Hui Lee, Mei Hung Chi, Kao Chin Chen, Po See Chen, Wei Jen Yao, Nan Tsing Chiu, and Yen Kuang Yang. Availability of dopamine transporters and auditory P300 abnormalities in adults with attention-deficit hyperactivity disorder: preliminary results. CNS spectrums, pages 1–7, 2017. [49] Nikolay Chumerin, Nikolay V Manyakov, Marijn van Vliet, Arne Robben, Adrien Combaz, and Marc M Van Hulle. Steady-state visual evoked potential-based computer gaming on a consumer-grade EEG device. IEEE Transactions on Computational Intelligence and AI in Games, 5(2):100–110, 2012. [50] Caterina Cinel, Riccardo Poli, and Luca Citi. Possible sources of perceptual errors in P300-based speller paradigm. Repository.essex.ac.uk, 2004. 149

[51] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Geoffrey Gordon, David Dunson, and Miroslav Dud´ık, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 215–223, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR.

[52] Mehmet C¸okyilmaz and Nahit Emanet. Classification of alcoholic subjects using multi channel ERPs based on channel optimization and probabilistic neural network. In 2011 IEEE 9th International Symposium on Applied Machine Intelligence and Informatics (SAMI), pages 225–229. IEEE, 2011.

[53] Jennifer L Collinger, Brian Wodlinger, John E Downey, Wei Wang, Elizabeth C Tyler- Kabara, Douglas J Weber, Angus JC McMorland, Meel Velliste, Michael L Boninger, and Andrew B Schwartz. High-performance neuroprosthetic control by an individual with tetraplegia. The Lancet, 381(9866):557–564, 2013.

[54] Marco Congedo, Pedro Rodrigues, and Christian Jutten. The Riemannian minimum distance to means field classifier. 2019.

[55] Tim RH Cutmore, Tatjana Djakovic, Mark R Kebbell, and David HK Shum. An object cue is more effective than a word in ERP-based detection of deception. Inter- national Journal of Psychophysiology, 71(3):185–192, 2009.

[56] Leandro da Silva-Sauer, Luis Valero-Aguayo, Alejandro de la Torre-Luque, Ricardo Ron-Angevin, and Sergio Varona-Moya. Concentration on performance with P300- based BCI systems: A matter of interface features. Applied Ergonomics, 52:325–332, 2016.

[57] Jerome Daltrozzo and Christopher M. Conway. Neurocognitive mechanisms of statistical-sequential learning: what do event-related potentials tell us? Frontiers in Human Neuroscience, 8:437, 2014.

[58] Yu-Qin Deng, Song Li, and Yi-Yuan Tang. The relationship between wandering mind, depression and mindfulness. Mindfulness, 5(2):124–128, 2014.

[59] Emanuel Donchin, Kevin M Spencer, and Ranjith Wijesinghe. The mental prosthesis: assessing the speed of a P300-based brain-computer interface. IEEE Transactions on Rehabilitation Engineering, 8(2):174–179, 2000.

[60] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.

[61] Riccardo Miotto Fei Wang Shuang Wang Xiaoqian Jiang Joel T. Dudley. Deep learn- ing for healthcare: review, opportunities and challenges. Briefings in Bioinformatics, 2017.

[62] Greg Efland, Sandip Parikh, Himanshu Sanghavi, and Aamir Farooqui. High perfor- mance dsp for vision, imaging and neural networks. IEEE Hot Chips, 28, 2016.

[63] Philippe Esling and Carlos Agon. Time-series data mining. ACM Computing Surveys (CSUR), 45(1):12, 2012. 150

[64] Fatemeh Fahimi, Zhuo Zhang, Wooi Boon Goh, Tih-Shi Lee, Kai Keng Ang, and Cuntai Guan. Inter-subject transfer learning with an end-to-end deep convolutional neural network for EEG-based BCI. Journal of Neural Engineering, 16(2):026007, 2019.

[65] Bahar Farahani, Farshad Firouzi, Victor Chang, Mustafa Badaroglu, Nicholas Con- stant, and Kunal Mankodiya. Towards fog-driven iot ehealth: Promises and challenges of iot in medicine and healthcare. Future Generation Computer Systems, 2017.

[66] Lawrence Ashley Farwell and Emanuel Donchin. Talking off the top of your head: toward a mental prosthesis utilizing event-related brain potentials. Electroencephalog- raphy and Clinical Neurophysiology, 70(6):510–523, 1988.

[67] Reza Fazel-Rezai and Kamyar Abhari. A comparison between a matrix-based and a region-based P300 speller paradigms for brain-computer interface. In Engineer- ing in Medicine and Biology Society, 2008. EMBS 2008. 30th Annual International Conference of the IEEE, pages 1147–1150. IEEE, 2008.

[68] Reza Fazel-Rezai, Brendan Z Allison, Christoph Guger, Eric W Sellers, Sonja C Kleih, and Andrea K¨ubler. P300 brain computer interface: current challenges and emerging trends. Frontiers in Neuroengineering, 5, 2012.

[69] Raffaella Folgieri, Claudio Lucchiari, Marco Granato, and Daniele Grechi. Brain, technology and creativity. brainart: A BCI-based entertainment tool to enact cre- ativity and create drawing from cerebral rhythms. In Digital Da Vinci, pages 65–97. Springer, 2014.

[70] Raffaella Folgieri and C Lucciari. Creative thinking: a brain computer interface of art. In International Conference on Live Interfaces. https://doi.org/10.13140/rg, volume 2, 2016.

[71] Judith M Ford. Schizophrenia: the broken P300 and beyond. Psychophysiology, 36(6):667–682, 1999.

[72] Abdur Rahim Mohammad Forkan and Ibrahim Khalil. Peace-home: Probabilistic estimation of abnormal clinical events using vital sign correlations for reliable home- based monitoring. Pervasive and Mobile Computing, 38:296 – 311, 2017. Special Issue IEEE International Conference on Pervasive Computing and Communications (PerCom) 2016.

[73] Islam A Fouad, Fatma El-Zahraa M Labib, Mai S Mabrouk, Amr A Sharawy, and Ahmed Y Sayed. Improving the performance of P300 BCI system using different meth- ods. Network Modeling Analysis in Health Informatics and Bioinformatics, 9(1):1–13, 2020.

[74] Jeremy Frey. Comparison of a consumer grade EEG amplifier with medical grade equipment in BCI applications. 2016.

[75] Adrian Furdea, Sebastian Halder, DJ Krusienski, Donald Bross, Femke Nijboer, Niels Birbaumer, and Andrea K¨ubler. An auditory oddball (P300) spelling system for brain-computer interfaces. Psychophysiology, 46(3):617–625, 2009. 151

[76] Ioannis Gadaras and Ludmil Mikhailov. An interpretable fuzzy rule-based classifica- tion methodology for . Artificial Intelligence in Medicine, 47(1):25– 41, 2009. [77] Sofien Gannouni, Nourah Alangari, Hassan Mathkour, Hatim Aboalsamh, and Kais Belwafi. Bcwb: A P300 brain-controlled web browser. International Journal on Semantic Web and Information Systems (IJSWIS), 13(2):55–73, 2017. [78] Andrew B. Gardner, Abba M. Krieger, George Vachtsevanos, and Brian Litt. One- class novelty detection for seizure analysis from intracranial EEG. J. Mach. Learn. Res., 7:1025–1044, December 2006. [79] Alan Gevins and Michael E Smith. Neurophysiological measures of working memory and individual differences in cognitive ability and cognitive style. Cerebral Cortex, 10(9):829–839, 2000. [80] Casey S Gilmore, Craig A Marquardt, Seung Suk Kang, and Scott R Sponheim. Re- duced p3b brain response during sustained visual attention is associated with remote blast mtbi and current ptsd in us military veterans. Behavioural brain research, 2016. [81] Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), pages 80–89. IEEE, 2018. [82] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010. [83] Craig J Gonsalvez and John Polich. P300 amplitude is determined by target-to-target interval. Psychophysiology, 39(3):388–396, 2002. [84] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016. [85] Xiaotong Gu, Zehong Cao, Alireza Jolfaei, Peng Xu, Dongrui Wu, Tzyy-Ping Jung, and Chin-Teng Lin. EEG-based brain-computer interfaces (BCIs): A survey of recent studies on signal sensing technologies and computational intelligence approaches and their applications. arXiv preprint arXiv:2001.11337, 2020. [86] Christoph Guger, Shahab Daban, Eric Sellers, Clemens Holzner, Gunther Krausz, Roberta Carabalona, Furio Gramatica, and Guenter Edlinger. How many people are able to control a P300-based brain–computer interface? Neuroscience letters, 462(1):94–98, 2009. [87] S Halder, I K¨athner, and A K¨ubler. Training leads to increased auditory brain– computer interface performance of end-users with motor impairments. Clinical Neu- rophysiology, 127(2):1288–1296, 2016. [88] David J. Hand. Mismatched models, wrong results, and dreadful decisions: On choos- ing appropriate data mining tools. In Proceedings of the 15th ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining, KDD ’09, page 1–2, New York, NY, USA, 2009. Association for Computing Machinery. 152

[89] He He and Dongrui Wu. Transfer learning enhanced common spatial pattern filtering for brain computer interfaces (BCIs): Overview and a new approach. In International Conference on Neural Information Processing, pages 811–821. Springer, 2017.

[90] Shenghong He, Tianyou Yu, Zhenghui Gu, and Yuanqing Li. A hybrid BCI web browser based on EEG and eog signals. In Engineering in Medicine and Biology Society (EMBC), 2017 39th Annual International Conference of the IEEE, pages 1006–1009. IEEE, 2017.

[91] Salah Helmy, Tarik Al-ani, Yskandar Hamam, and Essam El-madbouly. P300 based brain-computer interface using hidden markov models. In Intelligent Sensors, Sensor Networks and Information Processing, 2008. ISSNIP 2008. International Conference on, pages 127–132. IEEE, 2008.

[92] Andreas Herweg, Julian Gutzeit, Sonja Kleih, and Andrea KAŒbler.˜ Wheelchair con- trol by elderly participants in a virtual environment with a brain-computer interface (BCI) and tactile stimulation. Biological Psychology, 121(Part A):117 – 124, 2016.

[93] Ulrich Hoffmann, Gary Garcia, J-M Vesin, Karin Diserens, and Touradj Ebrahimi. A boosting approach to P300 detection with application to brain-computer interfaces. In Conference Proceedings. 2nd International IEEE EMBS Conference on Neural En- gineering, 2005., pages 97–100. IEEE, 2005.

[94] M. Hooshmand, D. Zordan, D. Del Testa, E. Grisan, and M. Rossi. Boosting the Bat- tery Life of Wearables for Health Monitoring through the Compression of Biosignals. IEEE Internet of Things Journal, PP(99):1–1, 2017.

[95] Fleur M Howells, Victoria L Ives-Deliperi, Neil R Horn, and Dan J Stein. Mindfulness based cognitive improves frontal control in bipolar disorder: a pilot EEG study. BMC , 12(1):15, 2012.

[96] Eduardo I´a˜nez, Jos´eM Azor´ın, Andr´es Ubeda,´ Eduardo Fern´andez, and Jos´eL Sir- vent. Lda-based classifiers for a mental tasks-based brain-computer interface. In 2010 IEEE International Conference on Systems, Man and Cybernetics, pages 546–551. IEEE, 2010.

[97] Lester Ingber. Statistical mechanics of neocortical interactions: I. basic formulation. Physica D: Nonlinear Phenomena, 5(1):83–107, 1982.

[98] NE Md Isa, A Amir, MZ Ilyas, and MS Razalli. Motor imagery classification in brain computer interface (BCI) based on EEG signal by using machine learning technique. Bulletin of Electrical Engineering and Informatics, 8(1):269–275, 2019.

[99] Hisao Ishibuchi, Yutaka Kaisho, and Yusuke Nojima. Complexity, interpretability and explanation capability of fuzzy rule-based classifiers. In 2009 IEEE International Conference on Fuzzy Systems, pages 1730–1735. IEEE, 2009.

[100] Hisao Ishibuchi, Yusuke Nakashima, and Yusuke Nojima. Search ability of evolution- ary multiobjective optimization algorithms for multiobjective fuzzy genetics-based machine learning. In 2009 IEEE International Conference on Fuzzy Systems, pages 1724–1729. IEEE, 2009. 153

[101] Erica S Shenoy MD PhD Jenna Wiens PhD. Machine learning for healthcare: On the verge of a major shift in healthcare epidemiology. Clinical Infectious Diseases.

[102] Yang-Whan Jeon and John Polich. Meta-analysis of P300 and schizophrenia: patients, paradigms, and practical implications. Psychophysiology, 40(5):684–701, 2003.

[103] E. Kaniusas. Biomedical Signals and Sensors I: Linking Physiological Phenomena and Biosignals. Biological and Medical Physics, Biomedical Engineering. Springer Berlin Heidelberg, 2012.

[104] Eugenijus Kaniusas. Fundamentals of Biosignals. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.

[105] Alexander Y Kaplan, Sergei L Shishkin, Ilya P Ganin, Ivan A Basyul, and Alexander Y Zhigalov. Adapting the P300-based brain–computer interface for gaming: a review. IEEE Transactions on Computational Intelligence and AI in Games, 5(2):141–149, 2013.

[106] Tobias Kaufmann, Andreas Herweg, and Andrea K¨ubler. Toward brain-computer interface based wheelchair control utilizing tactually-evoked event-related potentials. Journal of Neuroengineering and Rehabilitation, 11(1):7, 2014.

[107] Tobias Kaufmann, Elisa M Holz, and Andrea K¨ubler. Comparison of tactile, auditory, and visual modality for brain-computer interface use: a case study with a patient in the locked-in state. Frontiers in Neuroscience, 7, 2013.

[108] Tobias Kaufmann, SM Schulz, Claudia Gr¨unzinger, and Andrea K¨ubler. Flashing characters with famous faces improves ERP-based brain–computer interface perfor- mance. Journal of Neural Engineering, 8(5):056016, 2011.

[109] Y.U. Khan and J. Gotman. Wavelet based automatic seizure detection in intracerebral electroencephalogram. Clinical Neurophysiology, 114(5):898 – 908, 2003.

[110] Minah Kim, Tak Hyung Lee, Ji-Hun Kim, Hanwoom Hong, Tae Young Lee, Youngjo Lee, Dean F Salisbury, and Jun Soo Kwon. Decomposing P300 into correlates of genetic risk and current symptoms in schizophrenia: An inter-trial variability analysis. Schizophrenia Research, 2017.

[111] Sonja C Kleih and Andrea K¨ubler. Psychological factors influencing brain-computer interface performance. In Systems, Man, and Cybernetics (SMC), 2015 IEEE Inter- national Conference, pages 3192–3196. IEEE, 2015.

[112] Takumi Kodama and Shoji Makino. Convolutional neural network architecture and input volume matrix design for ERP classifications in a tactile P300-based brain- computer interface. In Engineering in Medicine and Biology Society (EMBC), 2017 39th Annual International Conference of the IEEE, pages 3814–3817. IEEE, 2017.

[113] Dean J Krusienski, Eric W Sellers, Fran¸cois Cabestaing, Sabri Bayoudh, Dennis J McFarland, Theresa M Vaughan, and Jonathan R Wolpaw. A comparison of clas- sification techniques for the P300 speller. Journal of neural engineering, 3(4):299, 2006. 154

[114] Rafal Kus, Diana Valbuena, Jaroslaw Zygierewicz, Tatsiana Malechka, Axel Graeser, and Piotr Durka. Asynchronous BCI based on motor imagery with automated cali- bration and neurofeedback training. IEEE Transactions on Neural Systems and Re- habilitation Engineering, 20(6):823–835, 2012.

[115] Owen Lahav, Nicholas Mastronarde, and Mihaela van der Schaar. What is inter- pretable? using machine learning to design interpretable decision-support systems. arXiv preprint arXiv:1811.10799, 2018.

[116] Martin L¨angkvist, Lars Karlsson, and Amy Loutfi. A review of unsupervised feature learning and deep learning for time-series modeling. Pattern Recognition Letters, 42:11–24, 2014.

[117] Mikhail A Lebedev and Miguel AL Nicolelis. Brain–machine interfaces: past, present and future. Trends in , 29(9):536–546, 2006.

[118] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.

[119] Hyong-Euk Lee, Kwang-Hyun Park, and Zeungnam Zenn Bien. Iterative fuzzy clus- tering algorithm with supervision to construct probabilistic fuzzy rule base from nu- merical data. IEEE Transactions on Fuzzy Systems, 16(1):263–277, 2008.

[120] Te-Won Lee, Mark Girolami, and Terrence J Sejnowski. Independent component anal- ysis using an extended infomax algorithm for mixed subgaussian and supergaussian sources. Neural Computation, 11(2):417–441, 1999.

[121] Wee Lih Lee, Tele Tan, and Yee Hong Leung. An improved P300 extraction using ica-r for P300-BCI speller. In Engineering in Medicine and Biology Society (EMBC), 2013 35th Annual International Conference of the IEEE, pages 7064–7067. IEEE, 2013.

[122] Rory Lewis, Michael Bihn, and Chad Mello. Machine intelligence: The neuroscience of chordal semantics and its association with emotion constructs and social demo- graphics. In International Symposium on Methodologies for Intelligent Systems, pages 290–299. Springer, 2015.

[123] Rory Lewis, Chad A. Mello, Amy Brooks-Kayal, Jessica Carlsenm, Heidi Grabenstat- ter, and Andrew M. White. Autonomous neuroclustering of pathologic oscillations using discretized centroids. MDA 2013, 2013.

[124] Rory Lewis, Chad A. Mello, and Andrew M. White. Tracking epileptogenesis pro- gressions with layered fuzzy k-means and k-medoid clustering. ICCS 2012, 2012.

[125] Rory Lewis, Chad A. Mello, Yanyan Zhuang, Martin K.-C. Yeh, Yu Yan, and Dan” Gopstein. Rough sets: Visually discerning neurological functionality during thought processes. In Foundations of Intelligent Systems, pages 32–41. Springer International Publishing, 2018.

[126] Feng Li, Yi Xia, Fei Wang, Dengyong Zhang, Xiaoyu Li, and Fan He. Transfer learning algorithm of P300-EEG signal based on xdawn spatial filter and Riemannian geometry classifier. Applied Sciences, 10(5):1804, 2020. 155

[127] Kun Li, Vanitha Narayan Raju, Ravi Sankar, Yael Arbel, and Emanuel Donchin. Advances and challenges in signal analysis for single trial P300-BCI. Foundations of Augmented Cognition. Directing the Future of Adaptive Systems, pages 87–94, 2011.

[128] Qi Li, Shuai Liu, Jian Li, and Ou Bai. Use of a green familiar faces paradigm improves P300-speller brain-computer interface performance. PloS One, 10(6):e0130325, 2015.

[129] Jeong-Hwan Lim, Han-Jeong Hwang, Chang-Hee Han, Ki-Young Jung, and Chang- Hwan Im. Classification of binary intentions for individuals with impaired oculomotor function: ‘eyes-closed’ ssvep-based brain–computer interface. Journal of Neural En- gineering, 10(2):026021, 2013.

[130] David EJ Linden. The P300: where in the brain is it produced and what does it tell us? The Neuroscientist, 11(6):563–576, 2005.

[131] Zachary Chase Lipton. The mythos of model interpretability. CoRR, abs/1606.03490, 2016.

[132] Ju-Chi Liu, Hung-Chyun Chou, Chien-Hsiu Chen, Yi-Tseng Lin, and Chung-Hsien Kuo. Time-shift correlation algorithm for P300 event related potential brain-computer interface implementation. Computational intelligence and neuroscience, 2016, 2016.

[133] Yisi Liu, Zirui Lan, Jian Cui, Olga Sourina, and Wolfgang M¨uller-Wittig. EEG- based cross-subject mental fatigue recognition. In 2019 International Conference on Cyberworlds (CW), pages 247–252. IEEE, 2019.

[134] Carla D Lopes, Josias O Mainardi, Milton A Zaro, and Altamiro Amadeu Susin. Clas- sification of event-related potentials in individuals at risk for alcoholism using wavelet transform and artificial neural network. In 2004 Symposium on Computational Intel- ligence in Bioinformatics and Computational Biology, pages 123–128. IEEE, 2004.

[135] Fabien Lotte, Laurent Bougrain, Andrzej Cichocki, Maureen Clerc, Marco Congedo, Alain Rakotomamonjy, and Florian Yger. A review of classification algorithms for EEG-based brain–computer interfaces: a 10 year update. Journal of Neural Engi- neering, 15(3):031005, 2018.

[136] Fabien Lotte, Marco Congedo, Anatole L´ecuyer, Fabrice Lamarche, and Bruno Ar- naldi. A review of classification algorithms for EEG-based brain–computer interfaces. Journal of Neural Engineering, 4(2):R1, 2007.

[137] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model pre- dictions. In Advances in Neural Information Processing Systems, pages 4765–4774, 2017.

[138] Teng Ma, Hui Li, Lili Deng, Hao Yang, Xulin Lv, Peiyang Li, Fali Li, Rui Zhang, Tiejun Liu, Dezhong Yao, et al. The hybrid BCI system for movement control by combining motor imagery and moving onset visual evoked potential. Journal of Neural Engineering, 14(2):026015, 2017.

[139] RK Maddula, J Stivers, M Mousavi, S Ravindran, and VR de Sa. Deep recurrent convolutional neural networks for classifying P300 BCI signals. In Proceedings of the Graz BCI Conference, 2017. 156

[140] BO Mainsah, LM Collins, KA Colwell, EW Sellers, DB Ryan, K Caves, and CS Throckmorton. Increasing BCI communication rates with dynamic stopping to- wards more practical use: an als study. Journal of neural engineering, 12(1):016013, 2015.

[141] Joseph N Mak, Dennis J McFarland, Theresa M Vaughan, Lynn M McCane, Phillippa Z Tsui, Debra J Zeitlin, Eric W Sellers, and Jonathan R Wolpaw. EEG correlates of P300-based brain–computer interface (BCI) performance in people with amyotrophic lateral sclerosis. Journal of Neural Engineering, 9(2):026014, 2012.

[142] Joseph N Mak and Jonathan R Wolpaw. Clinical applications of brain-computer inter- faces: current state and future prospects. IEEE Reviews in Biomedical Engineering, 2:187–199, 2009.

[143] Scott Makeig, Grace Leslie, Tim Mullen, Devpratim Sarma, Nima Bigdely-Shamlo, and Christian Kothe. First demonstration of a musical emotion BCI. In Interna- tional Conference on Affective Computing and Intelligent Interaction, pages 487–496. Springer, 2011.

[144] David Marshall, Damien Coyle, Shane Wilson, and Michael Callaghan. Games, game- play, and BCI: the state of the art. IEEE Transactions on Computational Intelligence and AI in Games, 5(2):82–99, 2013.

[145] V´ıctor Mart´ınez-Cagigal, Javier Gomez-Pilar, Daniel Alvarez,´ and Roberto Hornero. An asynchronous P300-based brain-computer interface web browser for severely dis- abled people. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 25(8):1332–1342, 2017.

[146] Rytis Maskeliunas, Robertas Damasevicius, Ignas Martisius, and Mindaugas Vasil- jevas. Consumer-grade EEG devices: are they usable for control tasks? PeerJ, 4:e1746, 2016.

[147] Raoof Masoomi and Ali Khadem. Enhancing lda-based discrimination of left and right hand motor imagery: Outperforming the winner of BCI competition ii. In 2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI), pages 392–398. IEEE, 2015.

[148] Genevieve McArthur, Timothy Budd, and Patricia Michie. The attentional blink and P300. Neuroreport, 10(17):3691–3695, 1999.

[149] Dennis J McFarland, William A Sarnacki, George Townsend, Theresa Vaughan, and Jonathan R Wolpaw. The P300-based brain–computer interface (BCI): effects of stimulus rate. Clinical Neurophysiology, 122(4):731–737, 2011.

[150] Chad A. Mello. Machine learning and training system for epileptiform oscillations in electroencephalograms. Master’s thesis, University of Colorado; Colorado Springs, 5 2014.

[151] Chad A. Mello, Rory Lewis, Amy Brooks-Kayal, Jessica Carlsen, Heidi Grabenstatter, and Andrew M. White. Supervised Learning for the Intensive Care Unit Using Single-Layer Perceptron Classifiers, pages 231–241. Springer International Publishing, Cham, 2014. 157

[152] J d R Mill´an, R¨udiger Rupp, Gernot R M¨uller-Putz, Roderick Murray-Smith, Claudio Giugliemma, Michael Tangermann, Carmen Vidaurre, Febo Cincotti, Andrea K¨ubler, Robert Leeb, et al. Combining brain-computer interfaces and assistive technologies: state-of-the-art and challenges. Frontiers in Neuroscience, 4, 2010.

[153] Jesus Minguillon, Miguel Angel Lopez-Gordo, Christian Morillas, and Francisco Pelayo. A mobile brain-computer interface for clinical applications: From the lab to the ubiquity. In International Work-Conference on the Interplay Between Natural and Artificial Computation, pages 68–76. Springer, 2017.

[154] H Mirghasemi, R Fazel-Rezai, and MB Shamsollahi. Analysis of P300 classifiers in brain computer interface speller. In Engineering in Medicine and Biology Society, 2006. EMBS’06. 28th Annual International Conference of the IEEE, pages 6205–6208. IEEE, 2006.

[155] J. Mladenovi´c, J. Mattout, and F. Lotte. A generic framework for adaptive EEG-based BCI training and operation. ArXiv e-prints, July 2017.

[156] Christoph Molnar. Interpretable machine learning. Lulu.com, 2019.

[157] Masaki Nakanishi, Yu-Te Wang, Tzyy-Ping Jung, John K Zao, Yu-Yi Chien, Alberto Diniz-Filho, Fabio B Daga, Yuan-Pin Lin, Yijun Wang, and Felipe A Medeiros. De- tecting glaucoma with a portable brain-computer interface for objective assessment of visual function loss. JAMA , 135(6):550–557, 2017.

[158] C Neuper, G.R M¨uller, A K¨ubler, N Birbaumer, and G Pfurtscheller. Clinical ap- plication of an EEG-based brain–computer interface: a case study in a patient with severe motor impairment. Clinical Neurophysiology, 114(3):399 – 409, 2003.

[159] Luis Fernando Nicolas-Alonso and Jaime Gomez-Gil. Brain computer interfaces, a review. Sensors, 12(2):1211–1279, 2012.

[160] K. D. Nielsen, A. F. Cabrera, and Omar Feix do Nascimento. EEG based BCI-towards a better control. brain-computer interface research at aalborg university. IEEE Trans- actions on Neural Systems and Rehabilitation Engineering, 14(2):202–204, June 2006.

[161] P.L. Nunez. Electric fields of the brain : the of EEG. Oxford University Press, 1981.

[162] Damir Nurseitov, Abzal Serekov, Almas Shintemirov, and Berdakh Abibullaev. Design and evaluation of a P300-ERP based BCI system for real-time control of a mobile robot. In 2017 5th International Winter Conference on Brain-Computer Interface (BCI), pages 115–120. IEEE, 2017.

[163] US National Library of Medicine. Electromyography mesh descriptor data 2017. https://meshb.nlm.nih.gov/record/ui?name=Electromyography, 08 2017.

[164] Ozan Ozdenizci,¨ Ye Wang, Toshiaki Koike-Akino, and Deniz Erdo˘gmu¸s. Transfer learning in brain-computer interfaces with adversarial variational autoencoders. In 2019 9th International IEEE/EMBS Conference on Neural Engineering (NER), pages 207–210. IEEE, 2019. 158

[165] Ozan Ozdenizci,¨ Ye Wang, Toshiaki Koike-Akino, and Deniz Erdo˘gmu¸s. Learning invariant representations from EEG via adversarial inference. IEEE Access, 8:27074– 27085, 2020. [166] Mayur Palankar, Kathryn J De Laurentis, Redwan Alqasemi, Eduardo Veras, Rajiv Dubey, Yael Arbel, and Emanuel Donchin. Control of a 9-dof wheelchair-mounted robotic arm system using a P300 brain computer interface: Initial experiments. In Robotics and Biomimetics, 2008. ROBIO 2008. IEEE International Conference, pages 348–353. IEEE, 2009. [167] Jiahui Pan, Qiuyou Xie, Pengmin Qin, Yan Chen, Yanbin He, Haiyun Huang, Fei Wang, Xiaoxiao Ni, Andrzej Cichocki, Ronghao Yu, et al. Prognosis for patients with cognitive motor dissociation identified by brain-computer interface. Brain, 143(4):1177–1189, 2020. [168] Vasilios Papaioannou, Christos Dragoumanis, and Ioannis Pneumatikos. Biosig- nal analysis techniques for weaning outcome assessment. Journal of Critical Care, 25(1):39 – 46, 2010. [169] Suneth Pathirana, David Asirvatham, and Gapar Johar. A critical evaluation on low-cost consumer-grade electroencephalographic devices. In 2018 2nd International Conference on BioSignal Analysis, Processing and Systems (ICBAPS), pages 160–165. IEEE, 2018. [170] Novi Patricia and Barbara Caputo. Learning to learn, from transfer learning to domain adaptation: A unifying perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1442–1449, 2014. [171] Terence W Picton. The P300 wave of the human event-related potential. Journal of Clinical Neurophysiology, 9(4):456–479, 1992. [172] M Pierantozzi, MG Palmieri, P Mazzone, MG Marciani, PM Rossini, A Stefani, P Giacomini, A Peppe, and P Stanzione. Deep brain stimulation of both subthalamic nucleus and internal globus pallidus restores intracortical inhibition in parkinson’s disease paralleling apomorphine effects: a paired magnetic stimulation study. Clinical Neurophysiology, 113(1):108–113, 2002. [173] Andreas Pinegger, Hannah Hiebel, Selina C Wriessnegger, and Gernot R M¨uller-Putz. Composing only by thought: Novel application of the P300 brain-computer interface. PloS One, 12(9):e0181584, 2017. [174] John Polich. Updating P300: an integrative theory of p3a and p3b. Clinical Neuro- physiology, 118(10):2128–2148, 2007. [175] John Polich and Jody Corey-Bloom. Alzheimer’s disease and P300: review and eval- uation of task and modality. Current Alzheimer Research, 2(5):515–525, 2005. [176] John Polich, Christine Ladish, and Floyd E Bloom. P300 assessment of early alzheimer’s disease. Electroencephalography and Clinical Neurophysiology/Evoked Po- tentials Section, 77(3):179–189, 1990. [177] Mirjana Prpa and Philippe Pasquier. Brain-computer interfaces in contemporary art: A state of the art and taxonomy. In Brain Art, pages 65–115. Springer, 2019. 159

[178] Pietari Pulkkinen and Hannu Koivisto. Fuzzy classifier identification using decision tree and multiobjective evolutionary algorithms. International Journal of Approxi- mate Reasoning, 48(2):526–543, 2008.

[179] SA Ramapure, PV Jadhav, and SV Kulkarni. Technique for detecting, measuring & comparing bio signals. International Journal of Computational Intelligence Research, 13(5):917–921, 2017.

[180] Gary E Raney. Monitoring changes in cognitive load during reading: an event-related brain potential and reaction time analysis. Journal of Experimental Psychology: Learning, Memory, and Cognition, 19(1):51, 1993.

[181] Mamunur Rashid, Norizam Sulaiman, Mahfuzah Mustafa, Sabira Khatun, Bifta Sama Bari, and Md Jahid Hasan. Recent trends and open challenges in EEG based brain-computer interface systems. In Ahmad Nor Kasruddin Nasir, Mohd Ashraf Ahmad, Muhammad Sharfi Najib, Yasmin Abdul Wahab, Nur Aqilah Othman, Nor Maniha Abd Ghani, Addie Irawan, Sabira Khatun, Raja Mohd Taufika Raja Is- mail, Mohd Mawardi Saari, Mohd Razali Daud, and Ahmad Afif Mohd Faudzi, editors, InECCE2019, pages 367–378, Singapore, 2020. Springer Singapore.

[182] Daran Ravden and John Polich. Habituation of P300 from visual stimuli. International Journal of Psychophysiology, 30(3):359–365, 1998.

[183] Brice Rebsamen, Etienne Burdet, Cuntai Guan, Chee Leong Teo, Qiang Zeng, Marcelo Ang, and Christian Laugier. Controlling a wheelchair using a BCI with low informa- tion transfer rate. In Rehabilitation Robotics, 2007. ICORR 2007. IEEE 10th Inter- national Conference, pages 1003–1008. IEEE, 2007.

[184] Ericka Janet Rechy-Ramirez and Huosheng Hu. Bio-signal based control in assistive robots: a survey. Digital Communications and networks, 1(2):85–101, 2015.

[185] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ”why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016.

[186] Greg Ridgeway, David Madigan, Thomas Richardson, and John O’Kane. Interpretable boosted na¨ıve bayes classification. In KDD, pages 101–104, 1998.

[187] Gordon Robertson, Graham Caldwell, Joseph Hamill, Gary Kamen, and Saunders Whittlesey. Research methods in biomechanics, 2E. Human Kinetics, 2013.

[188] Pedro Luiz Coelho Rodrigues. Exploring invariances of multivariate time series via Riemannian geometry: validation on EEG data. PhD thesis, 2019.

[189] Patrick J Rousche and Richard A Normann. Chronic recording capability of the utah intracortical electrode array in cat sensory cortex. Journal of Neuroscience Methods, 82(1):1–15, 1998.

[190] Yannick Roy, Hubert Banville, Isabela Albuquerque, Alexandre Gramfort, Tiago H Falk, and Jocelyn Faubert. Deep learning-based electroencephalography analysis: a systematic review. Journal of Neural Engineering, 16(5):051001, 2019. 160

[191] Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206– 215, 2019. [192] Stefan R¨uping. Learning interpretable models. PhD thesis, 2006. [193] Tomasz M Rutkowski and Hiromu Mori. Tactile and bone-conduction auditory brain computer interface for vision and hearing impaired users. Journal of Neuroscience Methods, 244:45–51, 2015. [194] DB Ryan, G Townsend, NA Gates, K Colwell, and EW Sellers. Evaluating brain- computer interface performance using color in the P300 checkerboard speller. Clinical Neurophysiology, 128(10):2050–2057, 2017. [195] Mathew Salvaris and Francisco Sepulveda. Visual modifications on the P300 speller BCI paradigm. Journal of Neural Engineering, 6(4):046011, 2009. [196] Mohammad Reza Haji Samadi and Neil Cooke. Vog-enhanced ica for ssvep response detection from consumer-grade EEG. In 2014 22nd European Signal Processing Con- ference (EUSIPCO), pages 2025–2029. IEEE, 2014. [197] Saeid Sanei and Jonathon Chambers. EEG Signal Processing. Wiley, 2009. [198] Haline E Schendan, Nancy G Kanwisher, and Marta Kutas. Early brain potentials link repetition blindness, priming and novelty detection. Neuroreport, 8(8):1943–1948, 1997. [199] Christos N Schizas and Constantinos S Pattichis. Learning systems in biosignal anal- ysis. Biosystems, 41(2):105 – 125, 1997. [200] Holger Schultheis and Anthony Jameson. Assessing cognitive load in adaptive hyper- media systems: Physiological and behavioral methods. In Adaptive hypermedia and adaptive web-based systems, pages 225–234. Springer, 2004. [201] Hilit Serby, Elad Yom-Tov, and Gideon F Inbar. An improved P300-based brain- computer interface. IEEE Transactions on Neural Systems and Rehabilitation Engi- neering, 13(1):89–98, 2005. [202] G. M. Shepher. The Synaptic Organization of the Brain. Oxfor University Press, London, 1974. [203] Jerry J Shih, Dean J Krusienski, and Jonathan R Wolpaw. Brain-computer interfaces in medicine. In Mayo Clinic Proceedings, volume 87, pages 268–279. Elsevier, 2012. [204] SH Shobeihi, AM Nasrabadi, and MH Moradi. Classification of individuals at risk for alcoholism using non-matching ERPs based on wavelet statistic features and artificial neural network. In Cairo International Biomedical Engineering Conference, 2006. [205] Ali Shoeb, Herman Edwards, Jack Connolly, Blaise Bourgeois, S. Ted Treves, and John Guttag. Patient-specific seizure onset detection. Epilepsy and Behavior, 5(4):483 – 498, 2004. [206] Steven W. Smith. The Scientist and Engineer’s Guide to Digital Signal Processing. California Technical Publishing, San Diego, CA, USA, 1997. 161

[207] Mohsen Soltanpour, Karim Faez, Saeed Sharifian, and Vahid Pourahmadi. Enhance evoked potentials detection using rbf neural networks: Application to brain-computer interface. In Signal Processing and Intelligent Systems (ICSPIS), International Con- ference of, pages 1–6. IEEE, 2016.

[208] Leif S¨ornmo and Pablo Laguna. Bioelectrical signal processing in cardiac and neuro- logical applications, volume 8. Academic Press, 2005.

[209] Rossella Spataro, Antonio Chella, Brendan Allison, Marcello Giardina, Rosario Sor- bello, Salvatore Tramonte, Christoph Guger, and Vincenzo La Bella. Reaching and grasping a glass of water by locked-in als patients through a BCI-controlled humanoid robot. Frontiers in Human Neuroscience, 11, 2017.

[210] Erwin-Josef Speckmann, Christian E. Elger, and Ali GorJi. Neurophysiologic basis of EEG and dc potentials. Electroencephalography Basic Principles, Clinical Applica- tions, and Related Fields, 1999.

[211] William Speier, Corey Arnold, Jessica Lu, Aniket Deshpande, and Nader Pouratian. Integrating language information with a hidden markov model to improve communica- tion rate in the P300 speller. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 22(3):678–684, 2014.

[212] Susan S Spencer, Dennis D Spencer, Peter D Williamson, and Richard Mattson. Com- bined depth and subdural electrode investigation in uncontrolled epilepsy. , 40(1):74–74, 1990.

[213] Erik Strumbeljˇ and Igor Kononenko. Explaining prediction models and individual pre- dictions with feature contributions. Knowledge and Information Systems, 41(3):647– 665, 2014.

[214] Pascal Sturmfels, Scott Lundberg, and Su-In Lee. Visualizing the impact of feature attribution baselines. Distill, 5(1):e22, 2020.

[215] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. arXiv preprint arXiv:1703.01365, 2017.

[216] Shravani Sur and VK Sinha. Event-related potential: An overview. Industrial Psy- chiatry Journal, 18(1):70, 2009.

[217] Chuanqi Tan, Fuchun Sun, Tao Kong, Bin Fang, and Wenchang Zhang. Attention- based transfer learning for brain-computer interface. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1154–1158. IEEE, 2019.

[218] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. A survey on deep transfer learning. In International conference on artificial neural networks, pages 270–279. Springer, 2018.

[219] Ioan T¸ARCA,˘ Radu T¸ARCA,˘ and Tiberiu VESSELENYI. Application of BCI tech- nologies.

[220] Michal Teplan. Fundamentals of EEG measurement. Measurement Science Review, 2(2):1–11, 2002. 162

[221] Sergios Theodoridis and Konstantinos Koutroumbas. Chapter 3 - linear classifiers. In Sergios Theodoridis, , and Konstantinos Koutroumbas, editors, Pattern Recognition (Fourth Edition), pages 91 – 150. Academic Press, Boston, fourth edition edition, 2009.

[222] Erico Tjoa and Cuntai Guan. A survey on explainable artificial intelligence (xai): towards medical xai. arXiv preprint arXiv:1907.07374, 2019.

[223] Ryota Tomioka, Kazuyuki Aihara, and Klaus-Robert M¨uller. Logistic regression for single trial EEG classification. In Advances in Neural Information Processing Systems, pages 1377–1384, 2007.

[224] Shanbao Tong and Nitish V. Thakor, editors. Quantitative EEG Analysis Methods and Applications. Artech House, 2009.

[225] G Townsend and V Platsko. Pushing the P300-based brain–computer interface be- yond 100 bpm: extending performance guided constraints into the temporal domain. Journal of Neural Engineering, 13(2):026024, 2016.

[226] George Townsend, BK LaPallo, CB Boulay, DJ Krusienski, GE Frye, CKea Hauser, NE Schwartz, TM Vaughan, Jonathan R Wolpaw, and EW Sellers. A novel P300- based brain–computer interface stimulus presentation paradigm: moving beyond rows and columns. Clinical Neurophysiology, 121(7):1109–1120, 2010.

[227] Mario Tudor, Lorainne Tudor, and Katarina Ivana Tudor. Hans berger (1873-1941)– the history of electroencephalography, 2005.

[228] Istv´anUlbert, Eric Halgren, Gary Heit, and George Karmos. Multiple microelectrode- recording system for human intracortical applications. Journal of Neuroscience Meth- ods, 106(1):69–79, 2001.

[229] Axel Uran, Coert van Gemeren, Rosanne van Diepen, Ricardo Chavarriaga, and Jos´e del R Mill´an. Applying transfer learning to deep learned models for EEG analysis. arXiv preprint arXiv:1907.01332, 2019.

[230] Ronald S Valle and John M Levine. Expectation effects in alpha wave control. Psy- chophysiology, 12(3):306–309, 1975.

[231] Marjolein van der Waal, Marianne Severens, Jeroen Geuze, and Peter Desain. Intro- ducing the tactile speller: an ERP-based brain–computer interface for communication. Journal of Neural Engineering, 9(4):045002, 2012.

[232] Marijn van Vliet, Arne Robben, Nikolay Chumerin, Nikolay V Manyakov, Adrien Combaz, and Marc M Van Hulle. Designing a brain-computer interface controlled video-game using consumer grade EEG hardware. In 2012 ISSNIP Biosignals and Biorobotics Conference: Biosignals and Robotics for Better and Safer Living (BRC), pages 1–6. IEEE, 2012.

[233] Luk´aˇsVaˇreka and Pavel Mautner. Stacked autoencoders for the P300 component detection. Frontiers in Neuroscience, 11, 2017. 163

[234] Hern´anDar´ıo Vargas Cardona, Mauricio A. Alvarez,´ and Alvaro´ A. Orozco. Multi-task learning for subthalamic nucleus identification in deep brain stimulation. International Journal of Machine Learning and Cybernetics, Feb 2017.

[235] Alfredo Vellido. The importance of interpretability and visualization in machine learn- ing for applications in medicine and health care. Neural Computing and Applications, pages 1–15, 2019.

[236] Hannes Verschore, Pieter-Jan Kindermans, David Verstraeten, and Benjamin Schrauwen. Dynamic stopping improves the speed and accuracy of a P300 speller. In International Conference on Artificial Neural Networks, pages 661–668. Springer, 2012.

[237] Fran¸cois-Benoˆıt Vialatte, Monique Maurice, Justin Dauwels, and Andrzej Cichocki. Steady-state visually evoked potentials: focus on essential paradigms and future per- spectives. Progress in Neurobiology, 90(4):418–438, 2010.

[238] Jacques J Vidal. Toward direct brain-computer communication. Annual review of Biophysics and Bioengineering, 2(1):157–180, 1973.

[239] A Vo, Diep Nguyen, Thuy Pham, Kha Ha, and Eryk Dutkiewicz. Subject-independent ERP-based brain-computer interfaces. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2018.

[240] Peitao Wang, Jun Lu, Bin Zhang, and Zeng Tang. A review on transfer learning for brain-computer interface classification. In 2015 5th International Conference on Information Science and Technology (ICIST), pages 315–322. IEEE, 2015.

[241] Wikipedia. Biofeedback — wikipedia, the free encyclopedia, 2017. [Online; accessed 14-August-2017].

[242] Wikipedia. Electrocardiography — wikipedia, the free encyclopedia, 2017. [Online; accessed 14-August-2017].

[243] Wikipedia. Electromyography — wikipedia, the free encyclopedia, 2017. [Online; accessed 14-August-2017].

[244] Eyal Winter et al. The shapley value. Handbook of Game Theory with Economic Applications, 3(2):2025–2054, 2002.

[245] Jonathan Wolpaw and Elizabeth Winter Wolpaw. Brain-computer interfaces: princi- ples and practice. OUP USA, 2012.

[246] Jonathan R Wolpaw, Niels Birbaumer, William J Heetderks, Dennis J McFarland, P Hunter Peckham, Gerwin Schalk, Emanuel Donchin, Louis A Quatrano, Charles J Robinson, and Theresa M Vaughan. Brain-computer interface technology: a review of the first international meeting. IEEE Transactions on Rehabilitation Engineering, 8(2):164–173, 2000.

[247] Thakerng Wongsirichot and Anantaporn Hanskunatai. A classification of sleep dis- orders with optimal features using machine learning techniques. Journal of Health Research, 31(3):209–217, 2017. 164

[248] Allen R Wyler, George A Ojemann, Ettore Lettich, and Arthur A Ward. Subdural strip electrodes for localizing epileptogenic foci. Journal of Neurosurgery, 60(6):1195– 1200, 1984. [249] Minpeng Xu, Hongzhi Qi, Baikun Wan, Tao Yin, Zhipeng Liu, and Dong Ming. A hybrid BCI speller paradigm combining P300 potential and the ssvep blocking feature. Journal of Neural Engineering, 10(2):026001, 2013. [250] ALI YAdOLLAHPOUR and ABdOLHOSSEIN BAGdELI. Brain computer interface: Principles, recent advances and clinical challenges. 2014. [251] Florian Yger, Maxime Berar, and Fabien Lotte. Riemannian approaches in brain- computer interfaces: a review. IEEE Transactions on Neural Systems and Rehabili- tation Engineering, 25(10):1753–1762, 2016. [252] Erwei Yin, Timothy Zeyl, Rami Saab, Tom Chau, Dewen Hu, and Zongtan Zhou. A hybrid brain–computer interface based on the fusion of P300 and ssvep scores. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 23(4):693–701, 2015. [253] Erwei Yin, Timothy Zeyl, Rami Saab, Dewen Hu, Zongtan Zhou, and Tom Chau. An auditory-tactile visual saccade-independent P300 brain–computer interface. Interna- tional Journal of Neural Systems, 26(01):1650001, 2016. [254] Erwei Yin, Zongtan Zhou, Jun Jiang, Fanglin Chen, Yadong Liu, and Dewen Hu. A novel hybrid BCI speller based on the incorporation of ssvep into the P300 paradigm. Journal of Neural Engineering, 10(2):026012, 2013. [255] Han Yuan and Bin He. Brain–computer interfaces using sensorimotor rhythms: cur- rent state and future perspectives. IEEE Transactions on Biomedical Engineering, 61(5):1425–1435, 2014. [256] Raheel Zafar, Sarat C Dass, and Aamir Saeed Malik. Electroencephalogram-based de- coding cognitive states using convolutional neural network and likelihood ratio based score fusion. PloS One, 12(5):e0178410, 2017. [257] Thorsten Oliver Zander, Moritz Lehne, Klas Ihme, Sabine Jatzev, Joao Correia, Chris- tian Kothe, Bernd Picht, and Femke Nijboer. A dry EEG-system for scientific research and brain–computer interfaces. Frontiers in Neuroscience, 5:53, 2011. [258] Paolo Zanini, Marco Congedo, Christian Jutten, Salem Said, and Yannick Berthoumieu. Transfer learning: A Riemannian geometry framework with appli- cations to brain-computer interfaces. IEEE Transactions on Biomedical Engineering, 65(5):1107–1116, 2017. [259] Rui Zhang, Yuanqing Li, Yongyong Yan, Hao Zhang, Shaoyu Wu, Tianyou Yu, and Zhenghui Gu. Control of a wheelchair in an indoor environment based on a brain– computer interface and automated navigation. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 24(1):128–139, 2016. [260] Rui Zhang, Qihong Wang, Kai Li, Shenghong He, Si Qin, Zhenghui Feng, Yang Chen, Pingxia Song, Tingyan Yang, Yuandong Zhang, et al. A BCI-based environmental control system for patients with severe spinal cord injuries. IEEE Transactions on Biomedical Engineering, 2017. 165

[261] Tongda Zhang, Renate Fruchter, and Maria Frank. Are they paying attention? a model-based method to identify individuals’ mental states. Computer, 50(3):40–49, 2017.

[262] Xiang Zhang, Lina Yao, Xianzhi Wang, Jessica Monaghan, David Mcalpine, and Yu Zhang. A survey on deep learning based brain computer interface: Recent advances and new frontiers. arXiv preprint arXiv:1905.04149, 2019.

[263] Yujia Zhang, Kuangyan Song, Yiming Sun, Sarah Tan, and Madeleine Udell. ”Why should you trust my explanation?” Understanding uncertainty in lime explanations. arXiv preprint arXiv:1904.12991, 2019.

[264] Z. Zhang and X. Li. Use transfer learning to promote identification adhd children with EEG recordings. In 2019 Chinese Automation Congress (CAC), pages 2809– 2813, 2019.

[265] Shang-Ming Zhou and John Q Gan. Low-level interpretability and high-level inter- pretability: a unified view of data-driven interpretable fuzzy system modeling. Fuzzy Sets and Systems, 159(23):3091–3131, 2008. Appendix A

ADDITIONAL EXPERIMENTAL PERSPECTIVES There are many ways one could examine and compare the results of our experiments; therefore, we include additional graphs for examination. These graphs contrast performance for all models by themselves and give you an opportunity to examine the stability of each model across training set size and dimensionality. We also provide results where the models performed poorly for a given experiment. In addition, we compare some of the traditional classifiers with our transfer learning models to provide contrast across groups of experiments.

A.1 Traditional Classifier Experiments

Here we show that the amount of data the traditional classifiers were trained on mattered little with respect to performance. Dimensionality had a greater affected on overall classifier performance. The related ROC curves setup a baseline comparison that helps to demonstrate that our transfer learning models are capable of performing better than all of these tradition classifiers against the same tasks. See figures below that include experimental results across training set proportions as well as dimensionality for LR, LDA, RMDM, FFNN, CNN, GB, and RF.

Figure A.1: Logistic Regression ROC curves and AUC for match/no-match experimentation: left assesses stability across training set size, middle assesses stability across dimensionality, and right assesses HR/LR task where the algorithm performed at its worst. 167

Figure A.2: Linear Discriminant Analysis ROC curves and AUC for match/no-match experimentation: left assesses stability across training set size, middle assesses stability across dimensionality, and right assesses HR/LR task where the algorithm performed at its worst.

Figure A.3: Riemannian Minimum Distance to Mean ROC curves and AUC for HR/LR experimentation: left assesses stability across training set size, middle assesses stability across dimensionality, and right assesses match/no-match task where the algorithm performed at its worst.

Figure A.4: Feed-forward Neural Network ROC curves and AUC for match/no-match experimentation: left assesses stability across training set size, middle assesses stability across dimensionality, and right assesses HR/LR task where the algorithm performed at its worst. 168

Figure A.5: Convolutional Neural Network ROC curves and AUC for match/no-match experimentation: left assesses stability across training set size, middle assesses stability across dimensionality, and right assesses HR/LR task where the algorithm performed at its worst.

Figure A.6: Gradient Boosting ROC curves and AUC for match/no-match experimentation: left assesses stability across training set size, middle assesses stability across dimensionality, and right assesses HR/LR task where the algorithm performed at its worst.

Figure A.7: Random Forest ROC curves and AUC for match/no-match experimentation: left assesses stability across training set size, middle assesses stability across dimensionality, and right assesses HR/LR task where the algorithm performed at its worst. 169

A.2 Traditional Vs. Transfer Experiments

For direct comparisons of traditional BCI classifiers with our transfer models we present the following results. In each graph there are two solid ROC curves. One curve rep- resents the best transfer performer while the other the best traditional performer. The dot- ted curves represent other models from the traditional experiments provided for additional contrast. When examining these curves, take notice to how the performance progresses (or regresses) as dimensionality is increased.

Figure A.8: Comparison of best performing traditional BCI classification with best transfer learning model for HR detection on 8 channels across various training proportions. The solid red is the Siamese model with no weight freezing compared to the solid cyan representing the Riemannian Minimum Distance to Mean. CNN and FFNN were provided for additional contrast.

Figure A.9: Comparison of best performing traditional BCI classification with best transfer learning model for HR detection on 12 channels across various training proportions. The solid red is the Siamese model with no weight freezing compared to the solid cyan representing the Riemannian Minimum Distance to Mean. CNN and FFNN were provided for additional contrast. Notice of MDM shows slight improvement with the increase in channels. 170

Figure A.10: Comparison of best performing traditional BCI classification with best transfer learning model for HR detection on 60 channels across various training proportions. The solid red is the Siamese model with no weight freezing compared to the solid cyan representing the Riemannian Minimum Distance to Mean. CNN and FFNN were provided for additional contrast. Here you will notice how RMDM becomes unstable and degrades in performance when high dimensionality is introduced.

Figure A.11: Comparison of best performing traditional BCI classification with best transfer learning model for match detection on 8 channels across various training proportions. The solid red is the binary model with weight freezing compared to the solid blue representing the our FFNN model. LDA and RF were provided for additional contrast.

Figure A.12: Comparison of best performing traditional BCI classification with best transfer learning model for match detection on 12 channels across various training proportions. The solid red is the binary model with weight freezing compared to the solid blue representing the our FFNN model. LDA and RF were provided for additional contrast. 171

Figure A.13: Comparison of best performing traditional BCI classification with best transfer learning model for match detection on 60 channels across various training proportions. The solid red is the binary model with weight freezing compared to the solid blue representing the our FFNN model. LDA and RF were provided for additional contrast. Appendix B

ABOUT ELECTROENCEPHALOGRAMS Electroencephalograms (EEGs) are biosignals (see appendix C), and they represent the signatures of neurological activity in the brain [197]. These signals are captured in a variety of ways using multiple electrodes placed either on the outside of the brain or on the inside. Signals can be collected non-invasively using a cap that is placed over the outside of the scalp, semi-invasively through subdural matrices placed on the surface of the brain, or intracortically by placing electrodes directly into the cortex matter.

Figure B.1: Typical EEG recording of human brain. Within the central nervous system (CNS) are billions of neurons, and each produces potential. Contained within the membrane of a neuron’s cell body, there is a potential with negative polarity ranging from 60 to 70 mV [210]. This is known as the cell’s resting potential. This potential fluctuates with changes in synaptic activity. There are namely two states of synaptic function: excitatory and inhibitory. 173

Figure B.2: Changing the membrane potential for a giant squid by closing the Na gets and opening K gates. Sanei and Chambers [197] A synapse in an excitatory state will trigger an ex- citatory post-synaptic potential (EPSP) in the receiving neuron on the other side of the gap. EPSP depolarizes the cell, causing it to send an AP along its axon. Conversely, an inhibitory synaptic state will trigger an inhibitory post- synaptic potential (IPSP), causing hyper-polarization in the receiving cell, and the cell will not send an AP along its axon. If several APs are sent closely together, it can also amount to being a summation of EPSP, providing a specific threshold membrane potential is met [210]. De- pending on synaptic state of IPSP or EPSP, the receiv- ing neuron will either fire an AP and reset to its resting potential, or remain in a hyperpolarized state until it’s threshold is met, triggering EPSP [17] [202]. Once a neuron is stimulated through an EPSP, an AP is produced by the cell. An ionic exchange occurs be- Figure B.3: (a) The human brain.(b) Section of cerebral cortex tween sodium and potassium ions that ultimately triggers showing microcurrent sources due to a change in potential along the cell membrane. During synaptic and action potentials.(c) A this process, certain sodium gates are opened, creating an 4-second epoch of alpha rhythm and initial spike caused by depolarization whereupon sodium power spectrum are shown. (Artech gates close, and potassium gates open. Shortly thereafter, House [224]) the cell is temporarily hyperpolarized before returning to its resting potential [220] [40]. This process is illustrated for a giant squid in Figure B.2. It is not practical to detect and monitor the activity within a single neuron in a live human system due to the number of individual neurons present in the system. This cellular activity, when observed on a macro scale, is the summation of synchronized activity occurring in potentially hundreds of thousands of localized neurons [161] [210]. Cells in the CNS typically form synaptically connected groups that are responsible for performing various tasks, and tend to function rhythmically together. This is the information we are most interested in recording and analyzing. 174

Neuronal groups may function at differing rates, depending on the speed of electrical interactions between the cells involved. These groups tend to form oscillatory patterns at consistent intervals. The amplitudes of electrical impulses in these neuronal groups are picked up and recorded over time using electrodes that are connected to advanced electronic equipment. Such a recording produces an electroencephalogram (EEG) (see Figure B.1). An EEG is a recording of oscillatory activity of electrical impulses generated by the brain and relayed through electrodes [224] [197]. Refer to Figure B.3 for illustration. These oscillations are better known as brain rhythms, or brain waves. This proposal refers to EEG data as potentially coming from external scalp potentials, intracortically, or subdural.

B.1 Quantitative Electroencephalography (qEEG)

qEEG is simply a resultant pattern from the analysis or preprocessing of the raw EEG data. In fact, much of our machine learning research will utilize qEEG data - features we extract from the raw EEG data. For the purpose of exploring machine learning with Brain- Computer Interfacing (BCI), we will gather and process qEEG data. BCI is a direct, online channel between the brain and an interface that makes it possible to use ”neural prosthesis and human augmentation” [224]. We can use BCI in many applications where humans may interface to machines solely through thought. This is especially important when discussing the potential of improving the lives of people with disabilities; however, it does not preclude the use by healthy people for the purposes of entertainment and convenience. Some of these applications include robotic & mobility interfacing, synthetic speech interfacing and game interaction. According to recent research performed by the author, there is reason to believe that qEEG is a viable way to apply machine learning towards quantifying attention state and mindfulness as well. While BCI systems based on other methods, we will focus on the BCI with EEG interfacing. qEEG-based BCI is now entering an advanced stage in its adaptation and development. There are two main types of BCI systems based on oscillatory EEG in use today: (1) steady-state visual-evoked responses from the visual cortex (SSVERs) and (2) the sensorimotor rhythm (SMR) from the sensorimotor cortex. Codes for both systems are hidden in either amplitudinal data or in a signal’s frequency [224] [197]. The BCI interfacing works by interpreting brain wave patterns and turning them into actionable commands. In some instances, the user (subject) of the system must learn to hone the modulation of brain wave patterns so that they are made prominent in the signal, thus allowing for easy (or easier) decoding by the BCI system. Oftentimes feedback to the user is provided to improve this process for the user. Appendix C

BIOSIGNALS

C.1 Biosignals as Bodies of Continuous Data

Biological systems are constantly producing and providing data related to simply ex- isting and functioning. While research and application involving electrical biosignals spans decades [103], recent technological advances and the availability of cheap sensor technology in the private sector is driving increased interest in furthering the collection and applica- tion of high definition data across multiple sectors in industry and research [16,65,101]. In addition to traditional equipment used in medical and research settings, there are currently many practical products available in the marketplace that allow for continuous, noninvasive acquisition and processing of biosignals [72, 94, 103, 261]. The data contained in biosignals are important, not only for (classical) medical diag- nosis and treatment, but also for future applications such as a daily driver monitor, assessing a person’s cognitive or emotional state, evaluating a person’s current health status and well being, or assisting the impaired with walking and talking [129,152,158,160]. The techniques employed against this data that go beyond standard signal processing are many, and related research in this space is vast and vibrant. Nevertheless, machine learning algorithms are applied to biosignals with varying degrees of success (i.e. performance), depending on the problem domain, quality of research, and quality and quantity of available data. Further- more, it is unclear which learning techniques may offer the best performance under a given context or circumstance; in addition, labeling some data is further complicated by clinical subjectivity and disagreement [28, 151, 168, 199]. Indeed, it is clear that computing systems for biosignals can be used to recognize many conditions of interest, both in humans and animals. It is of considerable interest as to how reliable these systems can become, and to what degree they may be trusted to perform under critical conditions such as clinical settings or home patient care [61, 72, 179]. There are studies which show promising results, with much of the focus on templated learning, that is, if you train on and classify information in one or more subjects, then good results should be possible for other subjects that were not part of the original training set(s). This approach may work for many problem domains; however, other problems may require training on a specific subject’s data alone in order to produce good results [35,123,234]. Of course, training on a specific subject requires that we collect enough data from that subject to produce a good training set [150]. For example, to individually train a machine for a patient with epilepsy to identify pathological events, we can usually gather enough data over a course of days to record and label events we are interested in. This is usually performed in a controlled medical setting [151]. This ideal setting is not always possible. For example, training on a patient in the ER with sudden traumatic brain injury may not be feasible for a number of reasons: (1) we cannot know the subject’s particular patterns before the incident, (2) the environment can be chaotic due to personnel in the area, and (3) the focus is on saving a patient’s life, not training a machine. In that case, if we were to create and provide a decision support 176 system (DSS) for emergency personnel in the ER, it would have to consist of a system that was pre-trained on numerous patients; it would require a trained template. Given the vast number of topics and application, it is reasonable to assume that there may not be a one-size-fits-all approach to machine learning with biosignals. A successful application that yields good results in a practical way will depend on how the data is collected, what data is collected, what is needed from the data, the computing power of the target platform(s), and the environmental circumstances under which the system must perform. These circumstances, however, do not preclude a way towards standardization while improving the performance of applied machine learning with biosignals for selected problem domains; this notion underpins the objective of this research.

C.1.1 Signals Classified There is virtually an endless variety of what can be construed as biosignal data. Here, we cover a very brief overview of biosignal classification. The purpose of this section is to differentiate and make clear the classification of biosignals. First, there are two accepted groups of biosignals: those which are permanent, and those that are induced. Induced biosignals are artificially triggered, or induced. Induced biosignals last for the duration of the incitement; the signal will soon decay at a certain rate after the excitation is over [103]. An example of such a signal would be the application of an electrical current induced across the surface of skin or muscle tissue, causing the tissue to contract. This is a form of Electromyography (EMG). The data collected through electrodes during that time would be considered to be an induced signal. Permanent biosignals, on the other hand, consist of data we collect that exist without any outside stimuli; the source is already inside of the body. Electrocardiographic signals (ECGs) and electroencephalograms (EEGs) belong to this group of signals. The secondary classification considers the dynamic aspect of biosignals. There are two such classifications: static or quasi-static, and dynamic. Static signals carry information which rarely changes, or slowly changes over time; this can be considered to be a steady-state function. A signal consisting of body temperature is an example of a quasi-static signal. Dynamic biosignals may yield extensive changes in time where the dynamic process itself may hold information of interest. Whereas an ECG recording may settle into a steady-state function (static-like), it can become a signal with highly dynamical events; this attribute makes ECG (as well as EEG) a dynamic biosignal. The third classification considers the origin of the biosignal. This classification breaks down into seven different origins listed below with some examples:

Electric biosignals: Electroencephalography (EEG), Quantitative Electroencephalog- • raphy (qEEG), Electrocardiography (ECG & EKG), and Electrodermography (EDG)

Magnetic biosignals: Magnetocardiography (MCG), (MEG), • Magnetogastrography (MGG)

Mechanic biosignals: Mechanomyography (MMG), Pulmonary/respiratory plethys- • mography (using piezoelectric respiratory belt), which I will refer to as PPG.

Optic biosignals: Photoplethysmography (RPG). • 177

Acoustic biosignals: Phonography, Phonocardiography (PCG), Bronchophonographia • (BPG) Chemical biosignals: reflect chemical composition and its temporal changes in body • solids, liquids, and gases. Thermal biosignals: usually assesses highly heterogeneous mechanisms of heat loss • and heat absorption in the body. Example: body core temperature recordings.

Figure C.1: The possible classifications of biosignals according to their (a) existence, (b) dynamic, and (c) origin, with indicated heart rate fC, respiratory rate fR, and additional information. (Springer-Verlag Berlin Heidelberg)

C.1.2 Signals of Interest & Associated Data For this research, we will be collecting and analyzing electrical biosignals, or bio- electrical time [series] data, excluding most, if not all, of the other biosignal categories as 178 described above. To be more precise, the data we will concentrate heavily on signals that fall into the classification of permanent, dynamic or quasi-static, electric time [series] data. How we will acquire this data may vary. The information contained in biosignals that is of interest to us will be culled out in several forms, depending on the signal type and as- sociated problem domain. Currently, there are several readily-available sources for EEG data for humans and rats (Dr. White et al, Dr. Zhuang et al), and there is also available ECG, EDG, and RPG through EASE (Dr. Boult et al). Other data of interest may be collected either by the author or by other research teams the author will work with. For the edification of the committee members, included here is a brief overview of the various types of signals we may wish to experiment with during our research.

C.1.2.1 Electromyography (EMG)

Figure C.2: Pictorial outline of the decomposition of the surface EMG signal into its constituent motor unit action potentials. (Adapted from De Luca et al. 1982a.)

Electromyography (EMG) is a technique for monitoring and recording the electrical activity produced by skeletal muscles [187]. EMG is conducted using an electromyograph to record and produce an electromyogram. An electromyograph is a recording of the electric potential that is generated by ”muscle cells when they are electrically or neurologically activated” [163]. These signals can be used to analyze the biomechanics of human or animal movement as well as to detect system function and abnormalities [243]. For the purposes of this research, EMGs are not of primary interest. EMG data may be used to compliment our machine learning with qEEG data for P300 BCI. Research has shown that under certain conditions, a BCI may be improved by combining EMG data with qEEG information [184]. To that end, EMG data could be utilized during some experimentation.

C.1.3 Electrocardiography ECG Electrocardiography (ECG or EKG) is the process of recording the overall electrical activity of the heart. The resultant recording is made up of small electrical changes on the skin over time that arise from the heart’s ”electrophysiologic pattern of depolarizing and repolarizing during each heartbeat” [242]. Just as with EMGs, for the purposes of this research, ECGs could provide ancillary information for experimentation purposes. ECG 179

Figure C.3: Several seconds of ECG recording.

Figure C.4: ECG of a heart in normal sinus rhythm. data may be used to compliment our machine learning with EEG or other data, such as EDG. This is especially true with respect to research in emotional and mental state detection, if we should pursue some of the subject matter. The author’s recent work with data related to EASE (Dr. Boult et al) suggests that ECG data, combined with EDG data, may be useful for BCI-related purposes . The author suspects that EMG, when combined with EEG and EDG data, may provide enough information to train a machine to classify mental state, awareness, or other related attributes for a given subject. To that end, ECG data may be utilized during some experimentation.

C.1.4 Electrodermography (EDG)

Figure C.5: A sample GSR signal of 60 seconds duration

Electrodermography (EDG) is the process of recording skin conductance and skin potential as well as (indirectly) electrical resistance on the surface of a subject’s skin. Data related with EASE corresponds specifically to the galvanic skin response (GSR), which is a measurement of electrical skin resistance through a period of time. The resultant electrodermograph is a collection of resistance values expressed in KΩ (thousands of ohms). Already, this data is commonly used by clinicians and therapists to assess and treat anxiety disorders; GSR is also a component used in polygraphy [241]. EDG may provide ancillary information for experimentation during this research, should we choose to utilize it at all. EDG data could be used to compliment our machine learning for BCI. This is especially true with respect to research in emotional and mental state detection with respect to BCI UI use. The author’s recent work with data related to EASE (Dr. Boult et al) suggests that EDG data, combined with ECG data, could be useful in our research. Appendix D

SOFTWARE PACKAGES AND DATASETS Development and testing was performed on several machines:

Windows 10 with 4.5 GHz Intel Core i9 (14 cores), 64GB memory, 2 NVIDIA GTX • 1080 Ti Duke OC, each with 11GB RAM and 3,584 cores

Windows 10 with 3.7 GHz Intel Core i7 (8 cores), 24GB memory, NVIDIA GTX 1660 • Ti OC, 6GB RAM and 1,536 cores

MacBook Pro (late 2017), with a 2.9 GHz Intel Core i7 processor, 16GB memory, • Radeon Pro 560 4 GB, and OSX 10.14.6

TexStudio 2.12.22 – https://texstudio.org/ • PyCharm Community 2020.1 – https://www.jetbrains.com/pycharm/download/ • #section=mac

Jupyter Notebook – https://jupyter.org/ • Python 3.7 – https://www.python.org/ • captum – https://captum.ai/ • seaborn – https://seaborn.pydata.org/ • matplotlib – https://matplotlib.org/ • MNE – https://mne.tools/stable/index.html • scikit-learn – https://scikit-learn.org/stable/ • numpy – https://numpy.org/ • Keras – https://keras.io/ • PyTorch – https://pytorch.org/ • TensorFlow – https://www.tensorflow.org/ • OpenBCI – https://openbci.com/ •

D.1 Other Available BCI Data Sets

Besides the alcoholism dataset we used for our research, there are other (limited) data sets available online that may aid in future work. 181

BCI Competition III; dataset II: The goal of the ”BCI Competition III” is to validate • signal processing and classification methods for Brain-Computer Interfaces (BCIs). http://www.bbci.de/competition/iii/#datasets

BNCI Horizon 2020: Fosters collaboration and communication among key stake- • holders in the European CSA specifically in the field of brain-computer interfaces. http://bnci-horizon-2020.eu/

MULTIMEDIA SIGNAL PROCESSING GROUP: This webpage acts as a show- • case and gateway to the projects’ main achievements. https://mmspg.epfl.ch/ page-58318-en.html

Repositories of medical signals and images: dedicated to share experimental data • recorded and used by the laboratory of UAM. http://akimpech.izt. uam.mx/dokuwiki/doku.php?id=deposits.en

Engineuring.com: A collection of available EEG and BCI datasets on the Internet • (some may no longer be available). https://engineuring.wordpress.com/2009/ 07/08/downloadable-eeg-data/