Pca+Hmm+Svm for Eeg Pattern Classification

PCA+HMM+SVM FOR EEG PATTERN CLASSIFICATION Hyekyung Lee and Seungjin Choi Department of Computer Science and Engineering, POSTECH, Korea fleehk, [email protected] ABSTRACT 2. BACKGROUND 2.1. PCA Electroencephalogram (EEG) pattern classification plays an im- PCA aims to find a linear orthogonal transformation v = W u portant role in the domain of brain computer interface (BCI). Hid- (where u is the observation vector) such that the retained vari- den Markov model (HMM) might be a useful tool in EEG pat- ance is maximized. Alternatively, PCA is viewed as a minimizer tern classification since EEG data is a multivariate time series data of reconstruction error It turned out that these principles (variance which contains noise and artifacts. In this paper we present meth- maximizer or reconstruction error minimizer) leads to a symmet- ods for EEG pattern classification which jointly employ principal ric eigenvalue problem. The row vectors of W correspond to the component analysis (PCA) and HMM. Along this line, two meth- normalized orthogonal eigenvectors of the data covariance matrix. ods are introduced: (1) PCA+HMM; (2) PCA+HMM+SVM. Use- A simple approach to PCA is to use singular value decompo- fulness of principal component features and our hybrid method is sition (SVD). Let us denote the data covariance matrix by Ru = confirmed through the classification of EEG that is recorded dur- Ef uuT g where the superscript T denotes the transpose of vector ing the imagination of a left or right hand movement. or matrix. Then the SVD of Ru has the form T Ru = U uDuU u ; (1) 1. INTRODUCTION where U u is the eigenvector matrix and Du is the diagonal matrix whose diagonal elements correspond to the eigenvalues of Ru. Then the linear transformation W for PCA is given by The automatic classification of EEG patterns plays an important role in an EEG-based BCI system. It provides a new communica- T W = U u : (2) tion channel between human brain and computer. In general, EEG data is very noisy and contains several types of artifacts. Moreover, For dimensionality reduction, one can choose p dominant column EEG data consists of mixtures of several brain sources (which are vectors in U u which are the eigenvectors associated with the p invisible to us) and noisy sources, which makes the problem even largest eigenvalues in order to construct a linear transform W . harder. Many different methods for PCA have been developed. See [1, 4] Several attempts have been made to build an EEG-based BCI for further details on PCA. system. The system consists of two procedure: (1) feature extraction; (2) classification. For feature extraction, adaptive autore- 2.2. HMM gressive model (AAR), Hjorth parameters, power spectrum have widely been used. As a classifier, linear discriminant analysis HMM is a widely-used probabilistic method which is useful in (LDA), neural networks, and recently HMM were used [6]. modelling time series data. It has been extensively used in speech In this paper, we consider principal component features which recognition and computational biology. It is a simple dynamic capture the second-order statistical structure of the data. Although Bayesian network which can represent probability distributions PCA has been mainly used to analyze spatial data, however, in this over sequences of observations. paper we show that PCA is also a useful tool for time series data. Let us denote a sequence of observations fytg and a sequence Since PCA retains maximum variance, it is expected to provide of hidden states fstg where t = 1; : : : ; T . A HMM assumes two features that are robust to small noise. sets of conditional independence relations: (1) the observation yt is independent of all other observations and states given st; (2) Based on principal component features, we employ a HMM the state st depends on only st−1, i.e., states satisfy the first-order classifier that is a popular tool for modelling time series data. A re- Markov property. It follows from these conditional independence cent work on a HMM-based BCI system can be found in [5] where relation that the joint probability distribution of states and obser- Hjorth parameters were used. In this paper we show that principal vations can be factorized as component features improves the classification performance of a HMM (PCA+HMM). In addition we also present a hybrid method P (s1; : : : ; sT ; y1; : : : ; yT ) which combines HMM and support vector machine (SVM). These T two methods are described in Sec. 3 and their usefulness is con- = P (s1)P (y1js1) P (stjst−1)P (ytjst): (3) firmed by computer simulations. t=2 Y WC3,L 2 1.5 1 0.5 WC4,L left 0 Data HMM C3 -0.5 -1 -1.5 Segmentation -2 0 0.2 0.4 0.6 0.8 1 1.2 Compare 2 decision WC3,R 1.5 ( 1xT matrix likelihood 1 0.5 to C4 0 mxn matrix ) WC4,R HMMright -0.5 -1 0 0.2 0.4 0.6 0.8 1 1.2 T Figure 1: Schematic diagram for PCA-HMM1: Raw data is preprocessed in the stage of data segmentation to extract principal components which are used to learn two HMMs, each of which corresponds to left hand or right hand movement; Final decision resorts to the likelihood scores calculated by two HMMs. WC3,L 2 1.5 HMMC3,L 1 0.5 WC4,L 0 Data C3 -0.5 -1 C4,L -1.5 Segmentation HMM -2 0 0.2 0.4 0.6 0.8 1 1.2 2 SVM decision WC3,R 1.5 ( 1xT matrix 1 C3,R 0.5 to HMM C4 0 mxn matrix ) WC4,R -0.5 -1 0 0.2 0.4 0.6 0.8 1 1.2 HMMC4,R T Figure 2: Schematic diagram for PCA-HMM2: Raw data is preprocessed in the stage of data segmentation to extract principal components; Principal components extracted from each channel are used to learn two HMMs, thus, in total, four HMMs are trained; Likelihood scores are fed into SVM to make a final decision. A HMM assumes that hidden state variables are discrete-valued, Suppose we have a set of training data f(x1; z1); : : : ; (xN ; zN )g. i.e., st 2 f1; : : : ; Kg. The state vector st is a K-dimensional The decision function f(x) has the form vector with only one element being unity and the rest of elements being zeros. In other words, which element of the state vector is N f(x) = ziαik(x; xi) + b ; unity, depends on which state value is active. Then P (stjst−1) sgn (4) i=1 ! can be represented by a K × K state transition matrix that is de- X noted by Φ. P (s1) is a K dimensional vector for initial state where fαig are embedding coefficients and k(x; xi) is kernel that π probability that is denoted by . is represented by the dot product, i.e., A HMM allows either discrete-valued observations (discrete HMM) or real-valued observations (continuous HMM). In this pa- k(x; xi) = hx; xii ; (5) per, we only consider a continuous HMM because EEG is real- valued data. For real-valued observation vectors, P (ytjst) can be where < ·; · > denotes the dot product. modelled in many different forms such as a Gaussian, mixture of The optimal decision function is computed by the following Gaussians, or a neural network. quadratic programming Learning HMM consists of two steps: (1) inference step where N N N the posterior distribution over hidden states is calculated; (2) learn- 1 maximize J = αi − αiαj zizj k(xi; xj )(6) ing step where parameters (such as initial state probability, state 2 i=1 i=1 j=1 transition probability, and emission probability) are identified. The X X X well-known forward-backward recursion allows us to infer the pos- N terior over hidden states efficiently. More details on HMM can be subject to αi ≥ 0; i = 1; : : : ; N; and αizi = 0:(7) i=1 found in [7, 2] X More details on SVM can be found in [8]. 2.3. SVM Support vector machine (SVM) has been widely used in pattern 3. METHODS recognition and regression due to its computational efficiency and good generalization performance. It was originated from the idea We present two methods which jointly employ PCA and HMM for of the structural risk minimization that was developed by Vapnik EEG pattern classification. In our methods, we consider only C3 in 1970's [9]. and C4 channels located in sensorimotor cortex because we focus on binary classification of EEG patterns that are recorded during and vC4;L is also computed by vC4;L = W C4;Lu. the imagination of either a left or right hand movement. In the case of PCA-HMM1, the feature vector for HMML, Schematic diagrams for our methods which are named as PCA- yL, consists of principal components which concatenate vC3;L HMM1 and PCA-HMM2, are shown in Fig. 1 and Fig. 2, re- T T T and vC4;L, i.e., yL = vC3;L; vC4;L . The feature vector for spectively. Both methods employ data segmentation procedure HMMR, y , is constructed in the same manner. Both HMML R where time series data is decomposed into overlapping blocks in and HMMR are learned from a corresponding set of features order to extract principal components. In PCA-HMM1, principal that are computed from EEG data which is recorded during the component features extracted separately from C3 and C4 chan- imagination of either a left or right hand movement. Given a test nels are concatenated, then are fed into the corresponding HMM EEG data, we compute likelihood values, i.e., P (yjHMML) and (which models either left-movement or right-movement) for train- P (yjHMMR) and assign it to the class which gives bigger like- ing.

Load more