sensors
Review Application of Transfer Learning in EEG Decoding Based on Brain-Computer Interfaces: A Review
Kai Zhang 1,2, Guanghua Xu 1,2,*, Xiaowei Zheng 1,2 , Huanzhong Li 1, Sicong Zhang 1, Yunhui Yu 1 and Renghao Liang 1
1 School of Mechanical Engineering, Xi’an Jiaotong University, Xi’an 710049, China; [email protected] (K.Z.); [email protected] (X.Z.); [email protected] (H.L.); [email protected] (S.Z.); [email protected] (Y.Y.); [email protected] (R.L.) 2 State Key Laboratory for Manufacturing Systems Engineering, Xi’an Jiaotong University, Xi’an 710049, China * Correspondence: [email protected]
Received: 18 October 2020; Accepted: 4 November 2020; Published: 5 November 2020
Abstract: The algorithms of electroencephalography (EEG) decoding are mainly based on machine learning in current research. One of the main assumptions of machine learning is that training and test data belong to the same feature space and are subject to the same probability distribution. However, this may be violated in EEG processing. Variations across sessions/subjects result in a deviation of the feature distribution of EEG signals in the same task, which reduces the accuracy of the decoding model for mental tasks. Recently, transfer learning (TL) has shown great potential in processing EEG signals across sessions/subjects. In this work, we reviewed 80 related published studies from 2010 to 2020 about TL application for EEG decoding. Herein, we report what kind of TL methods have been used (e.g., instance knowledge, feature representation knowledge, and model parameter knowledge), describe which types of EEG paradigms have been analyzed, and summarize the datasets that have been used to evaluate performance. Moreover, we discuss the state-of-the-art and future development of TL for EEG decoding. The results show that TL can significantly improve the performance of decoding models across subjects/sessions and can reduce the calibration time of brain–computer interface (BCI) systems. This review summarizes the current practical suggestions and performance outcomes in the hope that it will provide guidance and help for EEG research in the future.
Keywords: EEG; transfer learning; review; decoding; classification
1. Introduction A brain–computer interface (BCI) is a communication method between a user and a computer that does not rely on the normal neural pathways of the brain and muscles [1]. According to the methods of electroencephalography (EEG) signal collection, BCIs can be divided into three types, namely, non-invasive, invasive, and partially-invasive BCIs. Among them, non-invasive BCIs realize the control of external equipment via EEG and by transforming EEG recordings into a command, which have been widely used due to their convenient operation. Figure1 shows a typical non-invasive BCI system framework based on EEG, which usually consists of three parts: EEG signal acquisition, signal decoding, and external device control. During this process, signal decoding is the key step to ensure the operation of the whole system.
Sensors 2020, 20, 6321; doi:10.3390/s20216321 www.mdpi.com/journal/sensors Sensors 2020, 20, 6321 2 of 25 Sensors 2020, 20, x FOR PEER REVIEW 2 of 25
FigureFigure 1.1. FrameworkFramework of of an an electroencephalography electroencephalography (EEG)-based (EEG)-based brain–computer brain–computer interface interface (BCI) system. (BCI) system. The representation of EEG typically takes the form of a high-dimensional matrix, which includes the informationThe representation of sampling of EEG points, typically channels, takes the trials, form and of subjectsa high-dimensional [2]. Meanwhile, matrix, the which most commonincludes thefeatures information of EEG-based of sampling BCIs includepoints, channels, spatial filtering, trials, and band subjects power, [2]. time Meanwhile, points, and the so most on. common Recently, featuresmachine of learning EEG-based (ML) hasBCIs shown include its powerfulspatial filtering, ability for band feature power, extraction time points, in EEG-based and so BCIon. tasksRecently, [3,4]. machineBCI learning technology (ML) based has shown on EEG its has powerful made great ability progress, for feature but theextraction challenges in EEG-based of weak robustness BCI tasks [3,4].and low accuracy greatly hinder the application of BCIs in practice [5]. From the perspective of signal decoding,BCI technology the reasons based are as on follows: EEG has First, made one great of the progress, main assumptions but the challenges of ML is that of weak training robustness and test anddata low belong accuracy to the greatly same feature hinder spacethe application and are subject of BCIs to in the practice same probability[5]. From the distribution. perspective However, of signal decoding,this assumption the reasons is often are violated as follows: in the First, field one of of bioelectric the main signal assumptions processing, of ML because is that di trainingfferences and in testphysiological data belong structure to the and same psychological feature space states and may are cause subject obvious to the variation same inprobability EEG. Therefore, distribution. signals However,from different this sessionsassumption/subjects is often on the violated same task in showthe field diff erentof bioelectric features andsignal distribution. processing, because differencesSecond, in EEG physiological signals are structure extremely and weak psycholo and aregical always states accompanied may cause obvious by unrelated variation artifacts in EEG. from Therefore,other areas signals of the brain, from whichdifferent potentially sessions/subject mislead discriminants on the same results task andshow decrease different the features classification and distribution.accuracy. Third, the strict requirements for the experimental conditions of BCI systems make it difficult to obtainSecond, large EEG and signals high-quality are extremely datasets inweak practice. and ar Ite isalways difficult accompanied for a classification by unrelated model basedartifacts on fromsmall-scale other samplesareas of to the obtain brain, strong which robustness potentiall andy highmislead classification discriminant accuracy. results However, and decrease large-scale the classificationand high-quality accuracy. datasets Third, are thethe basisstrict forrequirements guaranteeing for thethe decodingexperimental accuracy conditions of models. of BCI systems makeOne it difficult promising to obtain approach large toand solve high-quality these problems datasets is in transfer practice. learning It is difficult (TL). The for principlea classification of TL modelis realizing based the on knowledge small-scale transfersamples from to obtain different stro butng robustness related tasks, and i.e., high using classification existing knowledge accuracy. However,learned from large-scale accomplished and high-quality tasks to help datasets with new are the tasks. basis The for definition guaranteeing of TL the is asdecoding follows: accuracy A given ofdomain models.D consists of a feature space X and a marginal probability distribution P(X). A task T consists of a labelOne promising space y and approach a prediction to solve function thesef problems. A source is domain transferD learnings and a target (TL). The domain principleDT may of TL have is realizingdifferent featurethe knowledge spaces or transfer different from marginal different probability but related distributions, tasks, i.e., i.e., usingXs , XexistingT or Ps( Xknowledge) , PT(X) . learnedMeanwhile, from tasks accomplishedTs and TT tasksare subject to help to with different new labeltasks. spaces. The definition The aim of TL is toas helpfollows: improve A given the domainlearning D ability consists of theof a target feature predictive space X and function a marginalfT( ) in probabilityDT using thedistribution knowledge P(X in).D As taskand T sconsists[6]. · of a labelThere space are two y and main a scenariosprediction in function EEG-based . A BCIs, source namely, domain cross-subject and a transfer target domain and cross-session may havetransfer. different The goal feature of TL spaces is to findor different the similarity marginal between probability new and distributions, original tasks i.e., and then≠ to or realize ( ) the ≠ discriminative ( ). Meanwhile, and stationarytasks and information are subject transfer to across different domains label [spaces.7]. In this The study, aim of we TL attempted is to help to improvesummarize the the learning transferred ability knowledge of the target for predictive EEG based function on following (·) threein types: using Knowledge the knowledge of instance, in andknowledge [6]. of feature representation, and knowledge of model parameters. ThereThis review are two of TLmain applications scenarios forin EEGEEG-based classification BCIs, attemptednamely, cross-subject to addressthe transfer following and criticalcross- sessionquestions: transfer. What The problems goal of does TL is TL to solve find forthe EEGsimilarity decoding? between (Section new and3.1); original which paradigms tasks and then of EEG to realizeare used the for discriminative TL analysis? and (Section stationary 3.2); what information kind of datasetstransfer across can we domains refer to in[7]. order In this to study, verify thewe attemptedperformance to of summarize these methods? the transferred (Section 3.3 );knowl what typesedge offor TL EEG frameworks based on are following available? three (Section types: 3.4). KnowledgeFirst, the of searchinstance, methods knowledge for the of identificationfeature representation, of studies and are knowledge introduced of in model Section parameters.2. Then, the principleThis review and classification of TL applications criteria offor TL EEG are classific analyzedation in Sectionattempted3. Next, to address the TL the algorithms following for critical EEG questions: What problems does TL solve for EEG decoding? (Section 3.1); which paradigms of EEG are Sensors 2020, 20, x FOR PEER REVIEW 3 of 25 used for TL analysis? (Section 3.2); what kind of datasets can we refer to in order to verify the performance of these methods? (Section 3.3); what types of TL frameworks are available? (Section 3.4). First, the search methods for the identification of studies are introduced in Section 2. Then, the Sensorsprinciple2020 and, 20, 6321classification criteria of TL are analyzed in Section 3. Next, the TL algorithms for3 EEG of 25 from 2010 to 2020 are described in Section 4. Finally, the current challenges of TL in EEG decoding arefrom discussed 2010 to 2020 in Section are described 5. in Section4. Finally, the current challenges of TL in EEG decoding are discussed in Section5. 2. Methodology 2. MethodologyA wide literature search from 2010 to 2020 was conducted, resorting to the main databases, such as WebA wideof Science, literature PubMed, search and from IEEE 2010 Xplore. to 2020 The was keywords conducted, used resorting for the to electronic the main search databases, were such TL, electroencephalogram,as Web of Science, PubMed, brain–computer and IEEE Xplore. interface, The inte keywordsr-subject, used and for covariate the electronic shift. Table search 1 werelists the TL, collectionelectroencephalogram, criteria for inclusion brain–computer or exclusion. interface, inter-subject, and covariate shift. Table1 lists the collection criteria for inclusion or exclusion. Table 1. Inclusion and exclusion criteria. Table 1. Inclusion and exclusion criteria. Inclusion Criteria Exclusion Criteria • Published withinInclusion the Criteria last 10 • A focus on the processing Exclusion of invasive Criteria EEG, years (as transfer learning (TL) for electrocorticographyA focus (ECoG), on the processing magnetoencephalography of invasive EEG, • EEG Publishedhas been withinproposed the lastand 10 years (as transfer(MEG), source electrocorticographyimaging, fMRI, and (ECoG), so on, or joint studies • developedlearning in (TL) recent for EEGyears). has been proposedwith and EEG. magnetoencephalography (MEG), source developed in recent years). imaging, fMRI, and so on, or joint studies • A focus on non-invasive EEG with EEG. signalsA focus(as the on object non-invasive for discussion EEG signals in (as• the No specificNo description specific description of TL for of TLEEG for processing. EEG processing. • • this review).object for discussion in this review). A specific explanation of how to apply TL to • A specific explanation of how to EEG signal processing. \ \ apply TL to EEG signal processing.
The search method of this review is shown in Figure 2 2,, whichwhich waswas usedused toto identifyidentify andand toto narrownarrow down the collection of TL-based studies, resulting in aa totaltotal ofof 246246 papers.papers. Duplicates between all datasets and studies without full-text links were excluded.excluded. Finally, 80 papers that meet the inclusion criteria were included.
Figure 2. The search method for identifying relevant studies. Figure 2. The search method for identifying relevant studies.
3. Results
3.1. What Problems Does Transfer Learning Solve? This review of the literature on TL applications for EEG attempted to address the following critical questions: Sensors 2020, 20, x FOR PEER REVIEW 4 of 25
This review of the literature on TL applications for EEG attempted to address the following Sensors 2020, 20, 6321 4 of 25 critical questions:
3.1.1. The Problem of Differences Differences across SubjectsSubjects/Sessions/Sessions Although advanced methods such as machine learning have been proven to be a critical tool in EEG processing or analysis, they still suffer suffer from some limitations that hinder their wide application in practice. Consistency Consistency of of the the feature space and probability distribution of training and test data is an important prior condition of machine learning. However, However, in the field field of biomedical engineering, such as EEG based on BCIs, this hypothesis hypothesis is ofte oftenn violated. Obvious Obvious variatio variationn in in feature feature distribution distribution typically occurs in representationsrepresentations of EEGEEG acrossacross sessionssessions/subjects./subjects. This phenomenon results in a scattered distribution of EEG signal features, an incr increaseease in the difficulty difficulty of feature extraction, and a reduction in the performance of the classifier.classifier.
3.1.2. The Problem of Small Sample Size In recent years, machine learning and deep neuralneural networks have provided good results for the classificationclassification of linguistic features, images, sounds, and natural texts.texts. A main reason for its success is that their massive amount of data guarantees the performance of the classifier. classifier. However, in practical applications ofof BCI,BCI, itit isis didifficultfficult to to collect collect high-quality high-quality and and large large EEG EEG datasets datasets due due to to the the limitations limitations of ofstrict strict requirements requirements for for the the experimental experimental environment environment and availableand available subjects. subjects. The performanceThe performance of these of thesemethods methods is highly is highly sensitive sensitive to the number to the number of samples; of samples; a small sample a small size sample tends size to lead tends to overfittingto lead to overfittingduring model during training, model which training, adversely which a adverselyffects the classification affects the classification accuracy [8 ].accuracy [8].
3.1.3. The Problem of Time-Consuming Calibration A large amount of data are required to calibrate a BCI system when a subject performs a specific specific EEG task. This requirement commonly takes a long calibration session, which is inevitable for a new user. For example, when a subject performs a steady-statesteady-state visually evoked potential (SSVEP) speller task, the various commandscommands cause aa longlong calibrationcalibration time.time. However, collecting calibration data is time-consuming and laborious, which reducesreduces thethe eefficiencyfficiency ofof thethe BCIBCI system.system.
3.2. EEG Paradigms for Transfer Learning There are four paradigms of EEG-BCIs discussed in this paper and the percentage of these paradigms across collected studies are shownshown inin FigureFigure3 3..
Figure 3. TheThe percentage percentage of different different EEG pattern strategies across collected studies.
3.2.1. Motor Imagery Motor imagery (MI) is a mental process that imitates motor intention without real motion output [9], which activates the neural potential in primary sensorimotor areas. Different imagery tasks Sensors 2020, 20, 6321 5 of 25 will induce potential activity in different regions of the brain. Thus, this response can be converted into a classification task. The feature of MI signals is often expressed in the form of frequency or band energy [10]. Due to task objectives and various feature representations, a variety of machine learning algorithms (e.g., deep learning and Riemannian geometry) can be applied to the decoding of MI [11,12].
3.2.2. Steady-State Visually Evoked Potentials When a human receives a fixed frequency of flashing visual stimuli, the potential activity of the cerebral cortex is modulated to produce a continuous response related to the frequency (same or multiples) of these stimuli. This physiological phenomenon is referred to a SSVEP [13]. Due to their stable and obvious representation of signals, BCI systems based on SSVEP are widely used to control equipment such as mobile devices, wheelchairs, and spellers.
3.2.3. Event-Related Potentials Event-related potentials are responses for multiple or diverse stimuli corresponding to specific meanings [14]. P300 is the most representative type of ERP, which occurs about 300 ms after a visual or auditory stimulus. A feature classification model can be used for decoding P300.
3.2.4. Passive BCIs A passive BCI is a form of interaction that does not rely on external stimuli. It achieves a brain control task by encoding the mental activity from different states of the brain [15]. Common types of passive BCI tasks include driver drowsiness, emotion recognition, mental workload assessment, and epileptic detection [16], which can be decoded by regression and classification models [17,18].
3.3. Case Studies on a Shared Dataset Analysis between different datasets is not valid because they use different equipment or communication protocols. In addition, different mental tasks and collecting procedures also bring great differences to EEG. Therefore, the reviewed studies mainly concentrate on the TL across subjects/sessions in the same dataset. In Table2, we briefly summarize the publicly available EEG dataset in this review.
Table 2. Dataset.
Amount of Data Sampling Datasets Task Subject Channel Reference (Per Subject) Rate 3 sessions/416 BCIC-II-IV 2 MI classes 1 28 1000 Hz [19] trials BCIC-III-II P300 2 64 5 sessions 240 Hz [20] 4 sessions/280 BCIC-III-IVa 2 MI classes 5 118 1000 Hz [21] trials 2 sessions/288 BCIC-IV-2a MI 9 25 250 Hz [22] trials BCIC-IV-2b 2 MI classes 9 6 720 trials 250 Hz [23] 5 sessions/20 P300 speller P300 8 8 256 Hz [24] trials DEAP ER 32 40 125 trials 128 Hz [25] BCIC -III- MI 1 118 630 trials 200 Hz [26] IVC 3 sessions/15 SEED ER 15 64 200 Hz [27] trials 5 sessions/12 OpenMIIR Music Imagery 10 64 trials in four 512 Hz [28] tasks 844 hours’ CHB-MIT ED 22 23 256 Hz [29] collection Sensors 2020, 20, 6321 6 of 25
Sensors 2020, 20, x FOR PEER REVIEW 6 of 25 3.4. Transfer Learning Architecture InIn thisthis review, review, we we summarize summarize previous previous studies studies according according to “what to knowledge “what knowledge should be transferredshould be transferredin EEG processing.” in EEG processing.” Multi-step processing Multi-step for proce EEGssing across for subjects EEG across/sessions subjects/sessions results in discriminative results in discriminativeinformation in information different steps. in different Therefore, steps. determining Therefore, what determining should be what transferred should is be the transferred key problem is theaccording key problem to diff erentaccording EEG tasks.to different Pan et EEG al. [6 tasks.] proposed Pan et authoritative al. [6] proposed classification authoritative approaches classification based approacheson “what to based transfer.” on “what All papers to transfer.” collected All papers in this reviewcollected were in this classified review accordingwere classified to this according method to(Figure this method4). In the (Figure following 4). In sections, the following we have sectio selectedns, we several have selected representative several methodsrepresentative for analysis. methods for analysis.
FigureFigure 4. 4. DifferentDifferent approaches approaches to transfer learning.
3.4.1.3.4.1. Transfer Transfer Learning Based on Instance Knowledge ItIt isis oftenoften assumed assumed that that we we can can easily easily obtain obtain large large amounts amounts of markup of markup data from data a sourcefrom adomain, source domain,but this databut this cannot data be cannot directly be reused.directly Instancereused. In transferstance transfer approaches approaches re-weight re-weight some source some domain source domaindata as adata supplement as a supplement for the target for the domain. target domain. Based on Based instance on transfer,instance thetransfer, majority the ofmajority the literature of the literatureutilized the utilized measurement the measurement method to method evaluate to the evalua similarityte the similarity between data between from data the source from the and source target anddomains. target Thedomains. similarity The metricsimilarity was metric then converted was then intoconverted the transfer into the weight transfer coe ffiweightcient, coefficient, which was whichdirectly was used directly to instance used to transfer instance by transfer re-weighting by re-weighting the source the domain source data domain [30– 32data]. Herein,[30–32]. we Herein, have welisted have a few listed typical a few methods typical methods based on based instance on transfer.instance transfer. ReferenceReference [33] [33] proposed an instance TL method based on K–L divergence measurements. They They measuredmeasured thethe similarity similarity of of the the normal normal distribution distribu betweention between two domains two domains and transformed and transformed this similarity this similarityinto a transfer into a weight transfer coe weightfficient coefficient for the target for subject.the target subject.
SupposeSuppose that the normal distribution from the two datasets N 0 andandN 1 can can be be expressed expressed as: as:
N0 N(µ0, Σ0) , N1 N(µ1, Σ1) (1) ~ ∼ ( ,Σ ), ~ ∼ ( ,Σ ) (1) where µ i and andΣ i∑are are the the mean mean value value and and variance variance (i = (1i /=0), 1/0), respectively. respectively. The The K–L K–L divergence divergence of the of twothe twodistributions distributions can becan expressed be expressed as: as: ! ΣdetΣ [ ][ ] [( )]T 1( ) 1 0 KL N 0 =0.5 ( N1 = 0.5 − µ1 ) µΣ0 ( Σ 1−− µ 1) +µ (Σ0 + trace ΣΣ 1−)− (Σ0 ln )− K] (2)(2) − − − Σdet Σ1 − where KK denotesdenotes the the dimension dimension of of the the data, data, µ represents represents the mean value, value, and and Σ isis thethe variance, det representsrepresents calculation of the determinant. TheThe similarity similarity weight δ s can can be be calculated calculated by: by: