iii Abstract

How the human brain processes phonemes has been a subject of interest for linguists and neuroscientists for a long time. Electroencephalography (EEG) offers a promising approach to observe neural activities of phoneme processing in the brain, thanks to its high temporal resolution, low cost and noninvasiveness. The studies on Mismatch Negativity (MMN) effects in EEG activities in the 1990s suggested the existence of a language-specific central phoneme representation in the brain. Recent findings using magnetoencephalograph (MEG) also suggested that the brain encodes the complex acoustic-phonetic information of speech into the representations of phonological features before the lexical information is retrieved. However, very little success has yet been reported in classifying the brain activities associated with phoneme processing.

In my work, I proposed a classification framework which incorporates Principal Components Analysis (PCA), cross-validation and support vector machine (SVM) methods. The initial classification rates were not very good. Progress was made by using bootstrap aggregation (Bagging) scheme and introducing phase calculations. To calculate phase, I computed the Discrete Fourier Transform (DFT) of the original time-domain signal and kept the angles of the finite sample of frequencies. The resulting EEG spectral representation contains only the phase and frequency information and ignores the amplitudes. Using this method, the accurate rate of classifying averaged test samples of eight improved from 41% to 51%.

Furthermore, the qualitative analysis of the similarities between the EEG representations, derived from the confusion matrices, illustrates the invariance of brain and perceptual representation of phonemes. For brain and perceptual representation of consonants, voicing is the most distinguishable feature among voicing, continuant and . And the feature -height is more robust than vowel- backness in both brain and perceptual representation of .

By extending and further refining these methods, it is likely significant classification of other phonemes and features can be made.



First of all, I would like to express my gratitude to my principle advisor, Professor Patrick Suppes, for directing me to this interesting area and giving me his invaluable support and guidance throughout my study. The enthusiasm he has for research is infectious and encouraging. I want to thank Professor Bernard Widrow and Professor Stephen Boyd for helpful advises on both my research and academic progress and very insightful comments on the draft of this dissertation. I would also like to thank Professor Christopher Potts for serving as the chairman of my oral exam and giving valuable suggestions from the perspective of .

I am very fortunate to pursue my Ph.D. degree in a supportive and inspiring environment at Stanford University. Being able to work closely with a group of outstanding researchers is important to make my Ph.D. pursuit productive and enjoyable. I am especially grateful for the members of Suppes Brain Lab. In particular, I would like to acknowledge Marcos Perreau Guimaraes, who gave helpful advises and tips on SVM-with-Bagging methods of EEG classification and similarity analysis discussed in this dissertation. Dik Kin Wong, Logan Grosenick, Claudio Carvalhaes, Acacio de Barros and Lene Harbott gave lots of thoughtful ideas and asked motivating questions in group discussion. Blair Bohannan and Duc Nguyen helped me on collecting EEG data.

Finally, I would like to thank my family and my parents for their love and support.


Table of Contents Chapter 1 Introduction ...... 1

1.1 Phonemes and distinctive features ...... 1

1.2 Brain activities in phoneme perception ...... 7

1.2.1 Measurements of brain activities ...... 7

1.2.2 Brain activities in phoneme perception ...... 8

1.3 Motivation and Contribution ...... 9

1.4 Outline of the thesis ...... 12

Chapter 2 Relevant EEG Data ...... 13

2.1 Syllables-I data ...... 13

2.2 Syllables-III data ...... 14

2.3 Isolated-vowels data ...... 17

Chapter 3 Signal Processing Methods for Classifying EEG Data ...... 18

3.1 EEG pre-processing ...... 18

3.2 Classifiers based on brain-speech mapping ...... 22

3.2.1 Methodology ...... 22 Diagram of the classification model ...... 22 Speech features ...... 24 Parameters search ...... 25 Significance level: p-value ...... 25

3.2.2 Experimental results ...... 26

3.3 Support Vector Machine (SVM) classifiers ...... 29

3.3.1 Methodology ...... 29 SVM with Bootstrap aggregating ...... 29

vi Diagram of the classifier ...... 32

3.3.2 Classification results ...... 38 Linear vs. Nonlinear Kernels...... 38 Leave-out-one-subject experiment ...... 41 Experiment on the number of trials to calculate average ...... 42 Experiment on classifying individual EEG trials using data from single channel...... 43

3.4 Summary ...... 44

Chapter 4 Frequency Analysis of EEG Signals ...... 46

4.1 EEG signals in frequency domain ...... 46

4.2 EEG spectral features ...... 48

4.3 Classification results ...... 51

4.3.1 Compare the EEG features based on DFT ...... 51

4.3.2 Frequency selection ...... 54

Chapter 5 Invariant Similarities between Brain and Perceptual Representations of Phonemes ...... 58

5.1 Psychological experiments on phoneme perception ...... 58

5.2 Similarity measurements ...... 59

5.2.1 Semi-Order and Invariant Partial Order of similarities ...... 59

5.2.2 Partition tree of similarities ...... 61

5.3 Experimental data analysis ...... 62

5.3.1 Vowels ...... 62

5.3.2 Consonants ...... 66

Chapter 6 Classifiers Based-on Distinctive Features ...... 71


6.1 Classifying the distinctive features ...... 71

6.2 Distinctive-feature-based classifiers ...... 74

6.3 Parallel structure vs. Hierarchical structure ...... 75

Chapter 7 Conclusion and Prospects ...... 82

List of References ...... 84


List of Tables

Table 2.1: The traditional phonological features of the 8 consonants and 4 vowels ...... 15

Table 2.2: Chomsky-Halle‟s Distinctive features of the 8 initial consonants ...... 15

Table 3.1: Results of classifying the 4 consonants of Syllables-I data using brain- speech mapping method ...... 27

Table 3.2: Phoneme classification results using SVM-with-Bagging method with linear or non-linear kernels ...... 40

Table 3.3: Leave-one-subject-out classification results using SVM-with-Bagging method ...... 41

Table 4.1: Comparing the classification rates of 4 EEG spectral features ...... 52

Table 4.2: SVM-with-Bagging classification results using the EEG phase feature in the frequency range from 2Hz to 9Hz ...... 57

Table 5.1: Normalized confusion matrices of 4 vowels ...... 63

Table 5.2: Normalized confusion matrices of 8 consonants ...... 66

Table 6.1: Classifying the distinctive features ...... 73

Table 6.2: Vowels classification results using DF-based classifiers ...... 79

Table 6.3: Initial consonants classification results using DF-based classifiers .... 80

Table 6.4: The results of classifying the combination of voicing and continuant using SVM-with-Bagging model ...... 80


List of Figures

Figure 1.1: Spectrogram of the English syllable /pɑ/ and /fɑ/ ...... 4

Figure 1.2: Comparing the English syllables /fɑ/ and /vɑ/ ...... 5

Figure 2.1: EEG international 10-20 sensor location system...... 14

Figure 2.2: The layout of EGI-128 sensors system ...... 16

Figure 3.1: Example of EEG artifacts removing ...... 20

Figure 3.2: Independent Components Analysis ...... 21

Figure 3.3: Diagram of classifying brainwaves of speech stimuli by estimating the mapping between EEG and speech signal ...... 22

Figure 3.4: Classifying the 4 consonants /p/, /t/, /b/, /g/ of the Syllables-III data using brain-speech mapping method ...... 28

Figure 3.5: Diagram of SVM with bootstrap aggregating ...... 32

Figure 3.6: Diagram of the SVM-with-Bagging EEG classifier ...... 33

Figure 3.7: Mean validation accuracy on parameter search grid (K,C) using linear kernel ...... 39

Figure 3.8: The changing of 8 initial consonants classification rates with respect to the number of trials to calculate averages ...... 42

Figure 3.9: Performance of SVM-with-Bagging method on classifying 4 initial consonants using single channel data ...... 44

Figure 4.1: Average power spectral densities of EEG signal sampled at 250Hz . 47

Figure 4.2: Magnitude and phase frequency response of 1 Hz high-pass filter ... 50

Figure 4.3: Mean classification rates of parameter pair (L, H) obtained from 10- fold cross validation ...... 56

Figure 5.1: The similarities of brain representation and perceptual representation of 4 vowels ...... 65


Figure 5.2: Invariant partial order between brainwave and perceptual confusions of the vowels ...... 65

Figure 5.3: The similarities of brain and perceptual representation of 8 consonants ...... 67

Figure 5.4: Invariant partial order between brainwave confusions and perceptual confusions of the consonants ...... 69

Figure 6.1: Classifying 4 vowels in F1-F2 space ...... 76

Figure 6.2: Hierarchical models for classifying 8 classes ...... 78


Chapter 1 Introduction

1.1 Phonemes and distinctive features

Natural languages are organized hierarchically: sentences are built from phrases, phrases from words, words from syllables and syllables from phonemes. Phoneme is the smallest segmental unit of speech that differentiates meaningful words (Handbook of IPA, 1999). For example, in American English, the words light and right are pronounced differently only on the initial consonants /l/ and /r/, thus /l/ and /r/ are different phonemes in American English. Two sounds that belong to separate phonemes in one language or dialect may be variants of one phoneme in another language or dialect. (The two sounds are called allophones if they belong to the same phoneme in the language.) It is widely recognized that phonemes are language- specific. All the phonemes studied in this thesis are American English phonemes. In most languages, the number of phonemes ranges from twenty to sixty. Although the pronunciation of a phoneme can be slightly different in various contexts, a phoneme has relatively stable articulatory and acoustic properties. Thus besides being used to derive and describe phonological rules, the concept of phoneme is also extensively used in building computational models of natural speech. Most of modern large- vocabulary speech recognition systems and systems are based on statistical models of acoustic features of phonemes. Phonemes are also very important for modeling the brain activities of or perception.

Linguists have proposed that phonemes can be further decomposed into distinctive features. Phonological features such as , nasal and stop had been used to describe speech sounds for a long time before the concept of distinctive feature was proposed. Those features are commonly referred to as traditional features. Such a

1 feature relates to either articulatory or acoustic properties of the sound. They are not necessarily binary and so may have more than two values. Ladefoged (1982) gave a good summary of the traditional features in his book. Around the middle of the 20th century, Jakobson and Halle introduced the notion of „distinctive features‟ as the smallest language components that are able to differentiate meaningful units (Jakobson, Morris & Halle, 1956). Unlike phonemes, distinctive features can overlap in time, thus they are the suprasegmental elements of language that carry lexical contrasts. Jakobson and Halle also proposed a set of distinctive features and gave both acoustic and articulatory descriptions of them. The Jakobson-Halle distinctive features are binary, which means each feature has two relative values. Today‟s most commonly used distinctive-feature system in literature is mostly taken from Chomsky and Halle‟s work „The Sound Pattern of English‟ (1968). Chomsky and Halle proposed in total 27 distinctive features. Each feature takes two values: a positive value, [+], denotes the presence of a feature, while a negative value, [-], indicates its absence. Their feature set is considered to be “universal”, which means they “represented the phonetic capabilities of man” and are therefore the same for all languages. Any phoneme can be represented as a set of distinctive features. For example, according to the Chomsky-Halle system, /p/ can be represented as: “[-vocalic] [+consonantal] [-high] [-back] [-low] [+anterior] [-coronal] [-voice] [-continuant] [- nasal] [-strident]”. (Chomsky & Halle, 1968)

Limited by the availability of brain data, we cannot explore all the distinctive features in this thesis. We will focus only on the brain representations of phonemes associated with the following features:

 Height and backness of vowels.

To describe the vowels, we use the vowel features of the International Phonetic Alphabet (IPA) chart: height and backness.

Vowel height is named for the vertical position of the tongue relative to either the roof of the mouth or the aperture of the jaw. In high vowels, such as /i/ and /u/, the tongue is positioned high in the mouth, whereas in low vowels, such as /ɑ/, the tongue

2 is positioned low in the mouth. In the IPA chart, the terms close and open are used to describe the jaw as being relatively open or closed. Although described using articulatory terms, vowel-height is nowadays defined as an acoustic quality according to the relative frequency of the first (F1)1. The higher the F1 value, the lower (more open) the vowel is . Height is thus inversely correlated to F1.

Vowel-backness refers to the position of the tongue during the articulation of a vowel. In front vowels, such as /i/, the tongue is positioned forward in the mouth, whereas in back vowels, such as /u/, the tongue is positioned towards the back of the mouth. Similar to vowel-height, vowel-backness is defined according to the frequency of the second formant (F2). The back vowels have lower F2 values and front vowels have higher F2 values. Thus vowel-backness is inversely correlated to F2.

 Continuant:

Continuant/non-continuant is a feature to describe the . In the production of a continuant sound, the primary constriction of the vocal tract is not completely closed, so the air flow past the constriction is not blocked. The such as /s/ or /z/ are continuant sounds. When we articulate a sound, the oral tract is held narrow enough, so that the airflow generates turbulent noise. In speech spectrograms, this friction noise often shows some clear power concentration in a specific frequency range. The non-continuant sounds include stops, such as /p/, /t/ and /g/, and nasal sounds, such as /m/, /n/. In this thesis, we will only focus on brain representations of plosive stops and fricatives. Plosive stops are characterized by a spectrographic "burst" with an abrupt onset. Figure 1.1 compares the spectrogram of the plosive stop /p/ and the fricative /f/, followed by the same vowel /ɑ/. The spectrogram of plosive stop /p/ has a sudden burst of energy across all the frequency range after a short closure at the beginning of the articulation. The formant pattern of vowel /ɑ/ emerges shortly after the burst, which shows that the duration of plosive stop is very short. The spectrogram of /f/ is characterized by high frequency

1 are defined by Fant (1960) as “the spectral peaks of the sound spectrum ( ) ”. They are produced by resonances of the vocal tract. The lowest resonant frequency is called the first formant , the second and the third .

3 noise with a gradual onset. In addition, the duration of the fricative /f/ is much longer than that of /p/.

Figure 1.1: Spectrogram of the English syllable /pɑ/ and /fɑ/2

 Voicing:

The feature voicing is used to characterize the vibration of the vocal fold, which creates a periodic source wave during the articulation. Voiced sounds are produced with the vibration of the vocal cord and voiceless sounds are produced without the vibration. Periodicity is the main character that distinguishes voiced sounds from voiceless sounds. Figure 1.2 shows the waveform and the spectrogram of /fɑ/ and /vɑ/. The waveform of /v/ has obvious periodic structure, which comes from the vibration of the vocal cord. The formant-like low frequency energy distribution pattern of the spectrogram of /v/ is another indicator of voicing.

2 The spectrograms in Figure 1.1 and Figure 1.2(b) are generated using Praat (Boersma & Weenink, 2011).


(a) Speech waveforms of syllables /fɑ/ and /vɑ/

(b) Speech spectrograms of syllables /fɑ/ and /vɑ/

Figure 1.2: Comparing the English syllables /fɑ/ and /vɑ/


Phoneticians have also found the voice onset time (VOT), which denotes the time interval between the release of articulatory occlusion and the onset of low-frequency periodicity, is the primary perceptual cue to distinguish the voicing stops from the voiceless stops (Lisker & Abramson, 1964). The voiceless stops in English, such as /p/, /t/ and /k/, feature a short VOT around 20ms. But the voiced stops usually have negative VOTs, which means voicing onset leads the articulatory release. The negative VOTs are characterized by a low buzz noise during the consonant closure time. Another phonetic attribute that distinguish English voiceless stops /p/, /t/ and /k/ from voiced stops /b/, /d/ and /g/ is aspiration. Aspiration is very important in separating the two sets in initial position, because both sets are commonly produced with silent closure intervals in such cases. (Lisker & Abramson, 1964)

 Place of articulation:

Traditionally, the place of articulation of consonants refers to the place and manner of the obstruction of the airflow going through the vocal tract. In English, the obstruction may occur at many places along the oral tract, from bilabial (between the lips) to velar (between back of the tongue and the soft palate). The production of speech sounds can be simulated as a stimulation source, either periodic for voicing or white noise for voiceless, passing through a filter, which reflects the shape of the vocal tract. The different places of articulation modify the frequency response of the vocal tract filter and change the spectral properties of the output. In Jakobson & Halle‟s distinctive feature system, the place of articulation is denoted as several features describing the spectrum of sound, such as grave/acute, flat and sharp. Chomsky & Halle‟s distinctive feature system uses a series of cavity features, coronal, anterior, high, low, back, etc., to characterize the shape of the oral tract for both consonant and vowel articulation.

For plosive stops, the primary acoustic cue of the place of articulation is mostly in the transition portion of the F2 of the vowel that follows. The place of articulation in fricatives changes the resonant frequency of the front vocal cavity and is reflected by the position and shape of the peak in speech spectra. (Johnson, 2003)


1.2 Brain activities in phoneme perception

1.2.1 Measurements of brain activities

When a neuron is firing, it generates action potentials, which are discrete electrical pulses, and postsynaptic potentials, which typically last tens or even hundreds of milliseconds. The summation of postsynaptic potentials of thousands of approximately synchronized cortical neurons can induce the potential fluctuations on the scalp. Thus the brain cortical activities can be roughly observed by placing an electrode sensor on the scalp and recording the amplified signal. This technology is called electroencephalography, or EEG.

The magnetic field produced by the electrical activities of cortical neurons also can be measured, which is called magnetoencephalography, or MEG. Both EEG and MEG have high temporal resolution and can record the activities with 1kHz or higher sampling rates. However, the blurring of the potentials caused by the skull, which is a high-resistance conductor, can be avoided by recording the magnetic field. Thus MEG has better spatial resolution than EEG and provides more precise localization. On the other hand, since the MEG signals are on the order of a few femto-Teslas, shielding from external magnetic signals, including the Earth‟s magnetic field, is necessary. The magnetic shielding equipment, usually a magnetically shielded room, is very expensive and not portable.

Since the development of the functional Magnetic Resonance Imaging (fMRI) technique in the early 1990‟s, fMRI rapidly dominates the brain mapping field for its non-invasiveness and high spatial resolution, up to 1mm. fMRI measures the increased blood flow to regions of increased neural activity, marked by blood-oxygen-level dependence (BOLD) in magnetic resonance imaging (MRI) scan. The BOLD occurs after the increased neural activities with a delay of approximately 1 to 5 seconds and rises to a peak over 4 to 5 seconds. Therefore the fMRI has very low temporal resolution. As we know, English speech is delivered at a rate of roughly 3 words per

7 second. Thus fMRI itself cannot be used to observe the details of fast-changing brain activities, such as processing phonological or lexical information.

1.2.2 Brain activities in phoneme perception

How the human brain processes phonemes has been a subject of interest for linguists and neuroscientists for a long time. Historically, behavioral experiments of phoneme perception were carried out to explore the psychological discrimination of phonemes under various conditions (Miller & Nicely, 1955; Pickett, 1957; Wang & Bilger, 1973; Phatak et al, 2008). More detailed introductions of the behavioral experiments can be found in Chapter 5. Since the discovery of Mismatch Negativity (MMN) effects in EEG activities (Näätänen et al, 1978), MMN and its magnetic equivalent MMNm, have been used extensively to measure the neural activities reflecting subjects‟ ability to discriminate phonemes (See Näätänen 2001 for review). These results also suggested the existence of a language-specific central phoneme representation in the brain and pointed out its probable left-hemisphere locus (Näätänen, 1997). More recently, using MEG recordings, human brain activities indicating the perception of acoustic cues and more complex phonological features were examined (Obleser, et al. 2004; Eulitz et al. 2007; Frye et al. 2007). These findings suggested that the brain encodes the complex acoustic-phonetic information of speech into the representations of phonological features before the lexical information is retrieved. Invasive recordings of animal neural responses associated with human speech also demonstrate the temporal and spatial characteristics of the cortical activities reflecting the distinctive features of phonemes (Steinschneider et al. 1995; Steinschneider et al. 2003). It also shows that the discrimination of the neural- activities pattern matches the animals‟ behavioral discrimination of phonemes (Engineer, 2008) as well as the human psychological confusion of phonemes (Mesgarani, 2008). fMRI also provides a non-invasive method to pinpoint the location of cortical activities of phoneme perception in healthy human brain (Liebenthal, et al. 2005). Formisano (2008) reported the success in classifying brain activities of isolated vowels using fMRI. Considering the limited temporal resolution of the fMRI

8 technique, it would be difficult to extend this work to phonemes that have more complex time course than vowels presented in isolation.

1.3 Motivation and Contribution

Among the most commonly used technologies that can observe human brain activities, EEG provides a promising method to examine the brain activities of natural language processing because of its low-cost, high temporal resolution and non- invasiveness. To study the brain activities of language processing using EEG signal, we need to solve two problems.

First, the EEG recordings are usually large amounts of data that are contaminated by lots of noise. Appropriate signal-processing or statistical methods are needed to reduce the noise and extract meaningful components of the signal that carry the target information. The ideal scenario is that the data can be compressed into parameters, called as EEG feature parameters, without loss of much useful information.

Second, we need to develop the mathematical models to describe properties and distributions of the EEG feature parameters of the language processing activity in the brain. The complexity and the computation cost of constructing the mathematical model is highly related to the number of EEG feature parameters. Generally the smaller the EEG parameter list, the simpler the mathematical model required to describe it.

To demonstrate the effectiveness of an EEG feature parameter or a mathematical model, one of the most convincing approaches is to test whether the unknown EEG samples, represented as the feature parameters, can be classified using the mathematical model. Researchers in our lab have been working on the statistical problem of classifying EEG brainwave associated with stimuli of language constituents since the 1990s. We successfully classified brainwaves of sentences (Suppes & Han, 1998; Suppes & Han, 1999; Wong, Perreau-Guimaraes, et. al. 2004), words (Suppes & Lu, 1997) and syllables (Suppes, Han, etc, 1999). Classifying the

9 brainwaves of auditory stimuli is often more challenging than classifying that of the visually presented linguistic stimuli. (Suppes & Han, 1999) An experiment in classifying the brainwaves of phonemes was also reported. (Suppes, Perreau- Guimaraes et. al., 2009) In this experiment, 42% of trials of 4 consonants were correctly classified. However, the classification method was tested on the syllable data of the 1997 experiment, which collected only 6 channels of 800 trials from each subject. The size of the data is insufficient to test a more complicated classification model.

With this consideration in mind, I designed and implemented a new experiment to collect EEG data of syllables. The experiment focused on 8 consonants and 4 vowels, which were carefully selected to represent 5 distinctive features: voicing, continuant, place of articulation, vowel-height and vowel backness. The new dataset includes in total 21540 trials for all the 32 syllables. The number of trials from each subject ranges from 3584 to 7168. A supplemental dataset of isolated vowels was also collected in 2010.

The phoneme recognition results reported by Suppes & Perreau-Guimaraes et. al. (2009) was obtained using Singular Value Decomposition (SVD) and Linear Discriminate Classification (LDC) methods in a framework with two-layer cross- validation. I kept the original EEG preprocessing modular of the framework and modified the classification methods to implement Out-Of-Sample testing, classifying the averaged trials and classifying using SVM with bootstrap aggregating (Bagging). By introducing SVM, we were able to implement the non-linear classification. However, the classification results show that the non-linear methods cannot improve the classification accuracy. The modified algorithm with linear kernel can classify 46% of 426 averaged test samples of 8 consonants and 69% of 141 averaged test samples of 4 isolated vowels.

I also proposed a new approach to classify the brainwaves of auditory stimuli: classifying by estimating the mapping relations between the speech signal and the EEG brainwave signal. A preliminary study about estimating the linear transformation between the brainwaves and speech stimuli has been carried out. For

10 the best subject of the EEG data collected in 1997, the classification model can recognize 45% of individual test trials of 4 consonants, which is slightly better than the result of SVD-LDC methods.

Furthermore, using the classification model with Bagging SVM, I explored the frequency-domain representations of EEG brainwaves evoked by phoneme stimuli. I found that the EEG signals can be classified without loss of accuracy when the amplitude information of DFTs is eliminated. For classifying the averaged test samples of 8 consonants, the accuracy rate increased to 51% if only the phase pattern of frequency components from 2Hz to 9Hz is used.

I analyzed of the similarities between the EEG representations, derived from the confusion matrices obtained using Bagging SVM methods and demonstrated the invariant similarities of brain and perceptual representation of phonemes. For brain and perceptual representation of consonants, voicing is the most distinguishable feature among voicing, continuant and place of articulation. And the feature vowel- height is more robust than vowel-backness in both brain and perceptual representation of vowels.

I further refined the Bagging SVM classification model based on the findings that the brainwaves evoked by different phonemes with similar phonological properties are close to each other in the EEG feature domain. A simplified classification model based on distinctive features was proposed. In this model, brainwaves of phonemes are classified using the ensemble of binary classifiers, one for each distinctive feature. The binary classifiers can be organized hierarchically. This simplified classifier can recognize 47% of test samples of the 8 consonants and 65% of test samples of the 4 isolated vowels, which is slightly worse than the original Bagging SVM classification model. However, the distinctive-feature-based classifier can be directly extended to classify more phonemes.


1.4 Outline of the thesis

Chapter 2 gives the detailed description of the EEG data used in my work and the experiment setup for collecting the EEG recordings.

In Chapter 3, I introduce two models to classify brainwaves of phonemes: One is the brain-speech mapping method and the other is the classifier with Bagging SVM. The first method was tested on classifying the individual trials of 4 consonants using Syllables-I and Syllables-III data. The second method, which focuses on classifying averaged test samples, was tested using Syllables-III and isolated-vowels data.

In Chapter 4, I examine EEG representations of phonemes in the frequency domain by classifying the EEG response of phonemes using four EEG spectral features. The feature DFT is Discrete Fourier Transform (DFT) coefficients of EEG time-domain signals, which should be computed channel by channel. The feature AMP is composed by the amplitudes of all the frequency components of DFT coefficients. In feature PHS-1 and PHS-2, the amplitudes of DFT coefficients are eliminated and only the phase information is kept. The classification results of four spectral features are discussed. I also identify the frequency range of rhythmic activities of EEG that related to phoneme perception using our experiment data.

I analyze the similarities between the brainwave representations using the classification confusion matrices in Chapter 5. The graphs of semiorder and hierarchical trees are used to illustrate the similarities. The brain similarities of the phonemes are compared with perceptual similarities of phonemes obtained from psychological experiments.

In Chapter 6, the results of classifying distinctive features using Bagging SVM methods are discussed. I also extend the Bagging SVM algorithm to classify speech stimuli based on distinctive features and present the experimental results.

Chapter 7 concludes the thesis.


Chapter 2 Relevant EEG Data

Three datasets of EEG recordings of phoneme perception are used in our study. All these EEG experimental data were collected in our laboratory.

2.1 Syllables-I data

These EEG recordings of auditory syllable stimuli were collected in November, 1998 as an exploratory experiment. The experiment addressed 8 consonant-vowel (CV) format syllables and 24 syllable pairs made up by 4 consonants (/p/, /t/, /b/ and /g/) and 3 vowels (/ɑ/ as in spa, /u/ as in zoo and /oʊ/ as in boat). The stimuli syllables are listed below:

/tu/, /pɑ/, /gu/, /bɑ/, /toʊ/, /pu/, /goʊ/, /bu/

/babu/, /bɑpɑ/, /bubɑ/, /goʊgu/, /goʊtu/, /gugoʊ/, /gutoʊ /, /gutoʊ/

/pɑpu/, /pubɑ/, /pupɑ/, /tugoʊ/, /tugu/, /tutoʊ/, /toʊgoʊ/, /toʊtu/

/bɑpu/, /bupɑ/, /bupu/, /goʊtoʊ/, /pɑbɑ/, /pɑbu/, /pubu/, /toʊgu/

All the 32 speech stimuli were spoken by a male American-English native speaker, who is also the speaker of the stimuli in the other two experiments. We presented the auditory stimuli to participants via stereo speakers. The 32 stimuli were randomized and presented to the subject 12 times as the first part of the session. Then after a short break, all the stimuli were presented again for 13 times as the second part. Nine subjects participated the experiment but only the data from 3 subjects were used in this thesis. The subjects were instructed to listen to stimuli attentively while no behavioral response was required. The trial length, measured from the onset of one syllable to the onset of the next, is 2050ms. In total 800 trials were collected from

13 each subject. The Model-12 Grass amplifiers and Neuroscan‟s Version 3.0 software were used to measure and record EEG data. Sensors were attached to the scalp of subjects according to standard EEG 10-20 system as shown in Figure 2.1.

Figure 2.1: EEG international 10-20 sensor location system.

Only 6 sensors, C3, C4, T3, T4, T5 and T6, were connected in the first part. In the second part, an additional sensor, Cz, was also connected. Previous analysis results on this dataset were reported in (Suppes, 1999; Suppes, 2009)

2.2 Syllables-III data

In 2008, we collected a new dataset of EEG recordings of perceiving 32 CV format syllables, which are made of one of the eight consonants /p/, /t/, /b/, /g/, /f/, /s/, /v/ and /z/, and one of four vowels /i/ (see), /æ/(cat), /u/(zoo) and /ɑ/ (spa). The experiment was designed with several considerations in mind. First of all, we want to check if the significant classification accuracies on initial consonants using Syllable-I EEG data (Suppes, 2009) are repeatable. Second, we further extended the initial consonants from the 4 plosive stops to a set of 8 consonants to investigate three major phonological features of consonants: voicing, continuant (stop versus fricative) and place of articulation. We also carefully selected the vowels so that they locate at corners of the American-English vowel space area and hence are acoustically

14 separated. Table 2.1 and Table 2.2 list the phonological features of the consonants and vowels. Moreover, the EEG collection techniques have been significantly improved in these years. The newest equipment, which supports up to 128 sensors, can record EEG activities with much higher spatial resolution.

Table 2.1: The traditional phonological features of the 8 consonants and 4 vowels

voiceless voicing

Labial Alveolar Labial Alveolar/Velar

stop p t b g fricative f s v z


open close

front æ i backness back ɑ u

Table 2.2: Chomsky-Halle‟s Distinctive features of the 8 initial consonants

p t b g f s v z High - - - + - - - - Cavity Back - - - + - - - - Features Coronal - + - - - + - + Anterior + + + - + + + + Source Voiced - - + + - - + + Features Strident - - - - + + + + Manner Continuant - - - - + + + +

We recorded the Syllable-III data using EGI‟s Geodesic EEG System (GES) 300 platform. In order to take the variation of pronunciation into account, the auditory stimuli include 7 repetitions of each of 32 syllables read by a male American English native speaker. The recordings are saved as 44.1KHz mono WAV files. In a brainwave collection session, all the 224 sound stimuli were pseudo-randomly presented to the subjects for 4 times using stereo speakers. The participant subjects were instructed to listen to the sound attentively while looking at a focus point on the computer screen. We recorded the EEG data with a sampling rate of 1000Hz using


EGI 128 sensors system, with 124 monopolar channels with a common reference Cz. Two bipolar reference channels of eye movements were also recorded. The locations of sensors are shown in Figure 2.2. The length of one trial of brainwave recording is one second. In total 24 sessions from 4 subjects were collected. The number of trials from one subject ranges from 3584 to 7168. The complete dataset includes about 672 brainwave recordings of each syllable. Therefore we got approximately 672×4=2688 recordings for each consonant and 672×8=5376 recordings for each vowel.

Figure 2.2: The layout of EGI-128 sensors system


2.3 Isolated-vowels data

The isolated vowels data recorded in 2010 are complimentary of Syllable-III data. We recorded 7 repetitions of the 4 vowels used in Syllable-III data, spoken by the same speaker. In one EEG collection session, the 28 stimuli were presented to the subject randomly for 32 times using the same experiment setup as Syllable-III. We recorded 8 sessions from one subject and collected 1792 trials for each isolated vowel.


Chapter 3 Signal Processing Methods for Classifying EEG Data

3.1 EEG pre-processing

The potential changes on the scalp generated by the cortical neuron activities are as small as a few micro-volts. The EEG signals of interest are usually submerged in a large amount of electrical noise. The two major types are that coming from the equipment and environment, and that from other biological sources. The environmental noise includes AC electric power supply noise, which is around 50- 60Hz, the noise from the computers used for presenting stimuli and recording EEG data, and the noise from the analog amplifiers which amplify the EEG signal by several orders of magnitude. The biological activities can be eye blinks, heart beats and muscle contractions. Therefore, before applying any analysis or classification methods, we need to pre-process the EEG data to have cleaner signals.

We used digital filters to remove most of the environmental noise. A high-pass filter with the cut-off frequency at 1Hz can remove the DC offset of equipment and the slow artifacts associated with the skin conductance fluctuation. The AC electricity noise can be removed by a notch filter at 60Hz. Our previous studies on EEG of language stimuli show that the frequency components between 2 to 30Hz are more important for classification. (Suppes, 1999) Thus in the present research, we usually down-sample the EEG signals to 50-60Hz after applying anti-aliasing filters. The down-sampling significantly reduces the dimension of data to be analyzed and removes the high frequency noise as well.


The noise from other biological activities has a different character. Figure 3.1(a) shows 4 seconds of EEG recordings from the first 60 sensors of EGI-128 sensor system, sampling at 1KHz. The muscle-contraction noise is characterized by a burst of high frequency noise and usually disappears after low-pass filtering, as seen in the down-sampled data in Figure 3.1(b). Eye-blinking artifacts are the short-peak waves with high amplitude, commonly seen at the prefrontal electrodes.

eye blink eye blink

muscle contraction

Figure 3.1(a) Original EEG recording with 1KHz sampling rate

Figure 3.1(b) EEG signal after high-pass and down-sampled to 62.5Hz


Figure 3.1(c) Resulting EEG signal after removing eye artifacts. Figure 3.1: Example of EEG artifacts removing

The eye movement artifacts can be removed by visually inspecting the trials and rejecting the contaminated ones. But this is not practical in our study considering the large amount of data involved, for instance there are more than 20000 trials collected in the Syllable-III experiment. Since the eye blinks or movements are usually independent to the brain responses of stimuli. We can eliminate the artifacts from eye movements efficiently using Independent Component Analysis (ICA).

The ICA method solves the problem illustrated in Figure 3.2. Assume there are n independent signal sources in the target region 21 ,,, sss n , and the source signals are transmitted instantaneously to the m receptors on the scalp 21 ,,, rrr m . At each receptor, the received signal is a weighted mixture of the sources:

m i   ij j  ,,2,1 nisar (3.1) j1

Then we have r  As , where m ,  n and RARsRr nm . A is often referred as the mixing matrix. When  mn and A is invertible, let  AW 1 , then the sources

20 and be recovered as s  Wr . Here W is called the un-mixing matrix. In practice, A is always unknown and the ultimate goal of ICA is to find the un-mixing matrix that maximizes the statistical independence of the sources.

In our study, we took all the signals from the monopolar channels as the received signal r and estimated the un-mixing matrix using the Infomax method. (Jung, et al., 2000; Bell & Sejnowski, 1995) Next, we calculated the correlation coefficients between the derived sources and signals from each of the references channels, which were placed around the eyes to record the horizontal and vertical eye movements. If the method works well, most of the correlation coefficients should be very low. Then we remove the sources that are highly correlated to the reference channels by setting them to zero. More specifically, we removed all the independent sources with a correlation coefficient higher than 0.2 in our experiments. Finally the remaining sources are re-mixed to reconstruct the monopolar signals. Figure 3.1(c) shows the reconstructed EEG monopolar signals using the signals in Figure 3.1(b) as the input. We can see the eye blink artifacts were removed.



w11 w31 w21

r1 r3 r2

Figure 3.2: Independent Components Analysis


3.2 Classifiers based on brain-speech mapping

3.2.1 Methodology

This section introduces the preliminary study of classifying EEG brainwaves of phoneme stimuli by estimating the mapping relations between the speech signal and the EEG brainwave signal. The basic idea underlying this approach is to consider the whole phoneme perception process in the brain as the activity of a system in a “black box”. The only observable aspects of the system are the input, which is the sound waves of speech stimuli, and the EEG brainwave as the output. Hence, if we could estimate the inverse system, we would be able to map brainwaves back to approximate speech inputs, and classify the brainwaves by comparing the estimated inputs to the speech prototype candidates. Diagram of the classification model

The classification procedure is shown in Figure 3.3.

Speech Speech Pre-process Y Find the optimal F(∙) that minimize the difference Speech between Y and F(X) Prototypes Ỹk EEG EEG Xtrain Pre-process


Xtest Ŷ Find the closest Ŷ=F(X) Results prototype toŶ

Figure 3.3: Diagram of classifying brainwaves of speech stimuli by estimating the mapping between EEG and speech signal


At the pre-processing phase, EEG signals are down-sampled and filtered. The speech waves of stimuli are represented by feature vectors with reduced sizes, and at the same time, a prototype of speech signal is created for each phoneme. Details of speech signal processing will be given later. Then the EEG data are randomly divided into training set and test set. The training/test partition is balanced for all the stimuli. In other words, the numbers of training trials associated with each stimulus are equal. We compute the optimal mapping relation Fˆ , which minimizes the mean square estimation error between F(x) and y. Figure 3.3 shows the scheme of estimating one global transformation that applied to all the classes. Alternately, we could also assume the transformation between brainwaves and speech is unique for each phoneme. In this case, N transformations should be estimated for the N-class classification problem. A test sample x is classified as:

ˆ ˆ ~ 2 k  minarg x)(F Yk for global transformation (3.2)  ,,1 Nk


ˆ ˆ ~ 2 k  minarg k x)(F Yk for class-specific transformations (3.3)  ,,1 Nk For the purpose of exploratory, we assume the transformation is linear, i.e. F(x)  Αx . Then if we estimate a linear transformation using m training samples

ii )()( )( ni ,( yx ),  ,,1 mi , where Rx denotes the observed EEG signal, and the

)( pi  Ry are the features of the associated speech stimuli, the optimal linear transformation RA np is the solution of the least-square optimization problem:

m 2 min Ax  y ii )()( A  i1 (3.4)

which can be easily calculated as:

m m 1  T  T  A   ii )()(  xxxy ii )()(  (3.5)  i1  i1 


When the number of training samples m is too small compared to the number of

m T variables in x, the matrix  xx ii )()( will be close to singular and non-invertible. Thus i1 to get an accurate estimation of the transformation matrix, we need sufficient training samples and the EEG observation vector cannot be too long. Speech features

To appropriately represent the speech stimuli, we hope to find the speech features with the size comparable to EEG brainwaves that also are able to distinguish different phonemes. The Mel-Frequency Cepstral Coefficients (MFCC), which describe the temporal-spectral distributions of speech, have been proved successful features to represent speech signals and commonly used in the modern speech recognition systems. (Rabiner & Juang, 1993) So we use MFCC as the speech features and to construct the prototype of the phoneme stimuli. The speech pre-processing procedure includes the following steps:

1) We manually examine the audio files of speech stimuli and mark the beginning and end time of the phonemes. Because of the co-articulation, the boundaries between the adjacent phonemes are not well-defined. As a result, the segmentation of phonemes can only be roughly determined.

2) The speech segments of targeted phonemes are cut into 30ms short frames, with 20ms overlap.

3) Calculate 12th -order MFCC speech features of each frame.

4) For each stimulus, compute average of the feature vectors across all the frames of the targeted phoneme. The average vector is the training target vector Y.

5) Average the MFCC features of all the frames corresponding to the initial ~ consonant k and get the prototypes Yk

24 Parameters search

Our previous studies show that when we classify EEG signals in the time domain, the classification rates may be improved if we only use the observations within a given temporal interval. (Wong, 2004) But the best temporal interval is data-specific and task-specific. In our experiments, we used Q-fold cross-validation to search for the best interval on a parameter grid. The two parameters to be optimized are the start point of the interval s and the interval duration d. The possible candidates of parameters form a searching grid ds ),( . In cross-validation, all the training trials are randomly divided into Q even groups. At each step of validation, one of the Q groups is used for testing and the other Q-1 groups are combined for training. A classification rate is obtained for each point of the parameter searching grid. The optimal parameters are chosen to meet the criterion of maximizing the average number of the correctly classified trials across the Q validation tests. Significance level: p-value

P-value is a statistical measurement of the significance of experiment results. Consider coin-flipping experiments. If for one experiment, we get 7 heads out of 10 flips, while for another experiment, 70 heads show in 100 flips, although both experiments show the probability of observing a head is 70%, we are more assured to claim that the coin used in the second one is biased. The p-value is the probability that the outcome is at least as extreme as the actually observed value, assuming the null hypothesis is true. In this example, the null hypothesis (H0) is that the coin is fair, i.e., the chance to observe a head in one flip is 0.5. Then the p-value of the first experiment is: 10 10   i 10i Pr(heads H 0 )|7     .0)5.01(5.0 1719 (3.6) i7  k 

For the second experiment: 100 100   i 100i 5 Pr(heads 70 H 0 )|    .3)5.01(5.0  1093 (3.7) i70 k 


The smaller the p-value, the more confident in rejecting the null hypothesis, and hence the more significant the result is.

In the N-class EEG classification problem, the null hypothesis is that the classifier cannot recognize any test sample and randomly assigns a label to each sample. The probability that one test sample is correctly recognized is p=1/N under the assumption of the null hypothesis. Thus if k of m test samples are classified correctly in one experiment, the p-value of the result is:

m m i im value-p     pp )1( (3.8) ki  i 

3.2.2 Experimental results

We first tested the classifier based on brainwave-speech mapping by classifying the 4 initial consonants, /p/, /t/, /b/ and /g/, of Syllables-I data. The six bipolar-channel data, which are C3-T5, C4-T6, T3-T3, T4-C4, T5-T3 and T6-T4, were down-sampled to 50Hz and passed through a 4th order Butterworth band-pass filter with the cut-off frequencies at 2Hz and 20Hz. The consonants were classified using data from each channel, each subject separately. For each subject, we collected 24 EEG trials of each stimulus. We randomly drew 16 trials for training and used the remaining 8 for test. Since there are 8 syllables that started with a given consonant, we have in total 8  12816 training trials and 88  64 test trials for each class. The total number of test trials is 256. The EEG interval defined by the start time s and duration d is optimized using 8-fold validation. Table 3.1 summarizes the classification rates and significance level of the results.

We can see that the classification accuracies show large variations among subjects. For subject AB, 44.9% of the 256 testing trials can be correctly classified using the best channels, the significance level p-value is less than 10-11. Subject PS got slightly lower classification accuracy rates, which is 39.8% with p-value<10-6. The significance levels of these results are high enough to prove the effectiveness of the

26 model on those subjects. However, the classification model barely works for subject SO. The classifier estimating class-specific transformations works better than the classifier using global transformation.

Table 3.1: Results of classifying the 4 consonants of Syllables-I data using brain-speech mapping method

Class-specific transformation global transformation subjects channels rates p-value rates p-value C3-T5 37.9% <10-5 31.6% 0.0098 C4-T6 44.9% <10-11 35.5% <10-3 T3-C3 44.9% <10-11 33.2% 0.0020 AB T4-C4 44.9% <10-11 31.6% 0.0098 T5-T3 39.8% <10-6 32.8% 0.0030 T6-T4 44.9% <10-11 28.1% 0.1399 C3-T5 37.5% <10-5 28.1% 0.1399 C4-T6 31.3% 0.0141 30.5% 0.0275 T3-C3 37.1% <10-4 33.2% 0.0020 PS T4-C4 39.8% <10-6 31.6% 0.0098 T5-T3 38.7% <10-6 34.4% <10-3 T6-T4 36.3% <10-4 34.0% <10-3 C3-T5 25.4% 0.4665 28.5% 0.1110 C4-T6 29.7% 0.0504 28.5% 0.1110 T3-C3 24.2% 0.6369 27.0% 0.2558 SO T4-C4 30.5% 0.0275 32.0% 0.0068 T5-T3 26.2% 0.3553 24.6% 0.5812 T6-T4 25.0% 0.5240 27.0% 0.2558

To check how the brain-speech mapping method performs when a large amount of training data is available, we classified the same 4 initial consonants of the Syllables-III data using the classifier with class-specific transformation matrices. We combined all the 8 sessions from subject LK and got 32 trials for each stimuli, 24 of them used for training and 8 for testing. Hence each transformation matrix can be estimated using 24 47  672 instances and in total 4478  896 trials are available to test the classification accuracy.


The classification was run on 124 monopolar channels respectively and classification rates of all the channels are shown in a brain map in Figure 3.4. Each number on the brain map denotes the classification rate using the monopolar channel data collected from the sensor at the corresponding scalp location. Although the classification accuracy rates are not improved, which is 36% for the best channels, the significance level of the results is very high (p-value<10-11) because more test trials were available. The brain map also shows that the signal from channels located at the left hemisphere of scalp carries more information about the phoneme compared to that from the right channels. The best rates were obtained from the channels around the left ear.

Figure 3.4: Classifying the 4 consonants /p/, /t/, /b/, /g/ of the Syllables-III data using brain-speech mapping method


3.3 Support Vector Machine (SVM) classifiers

3.3.1 Methodology

This section proposes a different approach to classify the EEG signals of phoneme stimuli. This method follows the traditional pattern classification strategy. The trials from each class, i.e. each phoneme, are given a unique class label. For example, to classify the 8 initial consonants in the Syllables- III data, we can label the 8 classes using number 1 to 8, which stand for /p/, /t/, /b/, /g/, /f/, /s/, /v/ and /z/ respectively. In other words, neither acoustic nor phonological information of the speech stimuli is taken into account in classification.

The main scheme underlying the statistical classification approach is SVM with bootstrap aggregating. I will introduce the idea of SVM with bootstrap aggregating at first. Then the diagram of the classifier will be described. SVM with Bootstrap aggregating

We use a soft-margin SVM as the basic classification unit in this classification model. (Cortes & Vapnik, 1995) The original SVM is a binary classifier looking for a separation hyperplane that maximizes the empirical functional margin, which is the largest distance between the hyperplane to the nearest training data points of either class. If the training data are consist of m samples x ,( ii )()( ),  ,,1 miy with )( Rx ni and y i)(  1,1  denoted the two class label of the samples, we write the hyperplane as a set of points that satisfy: T bxw  0 (3.9)

When the training data are separable, the optimal hyperplane can be found by the optimization problem:  max  ,, bw w (3.10) ,,1)( subject to iTi )()(   ,,1)( mibxwy

29 where  is the margin. With the scaling constraint  1, the optimization problem is equivalent to

1 2 min w ,bw 2 (3.11) ,,11)( subject to iTi )()(  ,,11)( mibxwy which can be efficiently solved.

However, the solution is very sensitive to outliers when the training data are noisy, and cannot be applied to non-linearly separable cases. Therefore the soft margin is introduced to allow training samples with margin less than 1 or even negative.

ii )()( Suppose a sample x y ),( has the margin 1i , the objective function would increase with a cost factor C. The optimization problem is reformulated as:

m 1 T min ww  C i ,, ξbw  2 i1 iTi )()( subject to y bxw  i ,1)( (3.12)

i  ,,1,0 mi

SVM can implement non-linear classification by simply applying the kernel trick. With the kernel, the optimization problem becomes

m 1 T min ww  C i ,, ξbw  2 i1 )( iTi )( subject to y  xw b  i ,1))(( (3.13)

i  ,,1,0 mi

It can be solved by optimizing the dual problem

1 T T min  α1Qαα α 2 subject to T αy  0 (3.14)

i  ,,1,0 miC where Q is an m-by-m positive semi-definite matrix with

jiji )()()()( ij  KyyQ xx ),( (3.15)

30 and K xx ji )()( ),( is the kernel. The kernel is the inner product of x i)( and x j)( in linear cases. The decision function for the test sample x is:

m  i)( i)(  h x  sgn)(  i xx ),(  bKy  (3.16)  i1 

The predicted class label of the test sample is 1 if the decision function is greater than zero and is -1 if the decision function is less than zero.

The following kernels were tested in our study:

 Linear kernel: K ),(  T zxzx

2  Gaussian radial basis function (RBF): K zx exp),(   zx 

T d  Polynomial kernel: K ),(  zxzx  C0  for d=2 and d=3.

To solve an N-Class classification problem, we construct a binary SVM for each pair of the N classes and predict a test sample as belonging to the class that wins the maximum number of “votes” from the binary classifiers. In total N(N-1)/2 “one- against-one” binary classifiers are needed.

We use the Matlab toolbox libsvm (Chang & Lin, 2001) to implement SVM training and predicting. The toolbox trains SVM using a Sequential-Minimal- Optimization (SMO)-type decomposition method. Since the solution is always sub- optimal as well as the noisy training data may not represent the structure of unseen data, the performance of SVM can be very unstable. A solution to this problem is Bootstrap Aggregating (Bagging). Bagging is a method to generate multiple versions of a classifier via the bootstrap sampling approach and use these to get an aggregate classification. (Breiman, 1996; Kim, 2002) The scheme of SVM with Bagging is shown in Figure 3.5.


Figure 3.5: Diagram of SVM with bootstrap aggregating

ii )()( Let ΤR  21  im  ,(;,,,{ yxzzzz )} denotes the training set. The bootstrap method randomly draws observations from TR and produces a replicate dataset of TR, noted as TR j)( , and repeats this drawing process B times with replacement. Each replication is drawn independently and is used to train a SVM classifier. Thus we get a set of B SVM classifiers, each one of them is trained independently by a replication of the training set. To predict the classification of a test sample, we aggregate the SVMs via majority voting. We first test the sample with all the SVMs and obtain a vector of prediction labels  21 ,,, cccC B . Considering prediction errors of SVMs should be random and independent, the final prediction label of the test sample is selected as the class that occurs most often in C. Empirical studies have shown the Bagging method generally outperforms the single SVM trained by the original training set TR. (Wang etc. 2009) Diagram of the classifier

The diagram of the EEG classifier based on SVM is shown in Figure 3.6. The pre-processing steps include high-pass filtering with 1Hz cutoff frequency, down- sampling and removing artifacts using ICA. For each trial of EEG recording, we concatenated the data from all the channels to create an observation vector. The length of the observation vector is the product of the number of channels and the trial length. In addition, all the individual EEG trials are relabeled using the class index. i.e. The


EEG trials are labeled using number 1 to 8 for classifying 8 consonants and using number 1 to 4 for classifying 4 vowels. This means at this stage, the classification is a blind process without knowing any information of other phonemes presented in the syllable stimuli.

SVM #1 Bootstrap repetition #1 Aggregating SVM voting “majority

Training #2 Bootstrap repetition #2 PCA results


. using using Set Transformation

” (TR) . SVM

matrix #B Raw pre- Bootstrap repetition #B

EEG proc Set Test

(TE) average using sample-without-

PCA replacement method

(a) Training bootstrap repetition #i

Set replication Set Sample-with-replacement & average

Optimize parameters

SVM train TR via cross-validation (i)

SVM model #i

TE SVM (i) model Sample-with-replacement SVM test & average

(b) Figure 3.6: Diagram of the SVM-with-Bagging EEG classifier


We divided the EEG trials randomly into a training set (TR) and an Out-Of- Sample (OOS) test set (TE). The classifier parameters are estimated using the training set only, hence independent of the OOS test set. The OOS test set is used to test the classification accuracy and generate the confusion matrices. Besides the SVM with Bagging, the classification model also makes use of the following statistical methods: Principal Component Analysis, Averaging and cross-validation. Next, each modular of the classification model will be explained in detail.

 Principal Components Analysis (PCA)

If the observed data include a large amount of variables, it is very likely that some variables are correlated. For EEG signal, data from adjacent channels are highly correlated to each other. Thus when we classify EEG using data from multiple channels, we applied PCA to reduce the number of variables. The PCA algorithm rotates the data to a new coordinate via an orthogonal linear transformation. The result data have the greatest variance aligned to the first coordinate, and the second greatest variance aligned to the second coordinate and so on (Jolliffe, 2002). In pattern classification, PCA can be used to reduce the feature size because of the underlying assumption that variables with very small variances are trivial for separating data from different classes. Thus we can truncate the transformed data and only keep the first K principal components without losing lots of information for classification. It is also equivalent to projecting the data to a reduced subspace with only K coordinates. The number of principal components kept in the feature vector K needs to be optimized using empirical data.

The PCA orthogonal linear transformation can be calculated with the following algorithm.

Suppose we have m observed samples x i)(  ,,1, mi and each sample has n variables )( Rx ni . First, we zero out the mean of data by replacing each x i)( with i)( μx , where

1 m i  xμ )( (3.17) m i1


Then the empirical covariance matrix of x is calculated as:

1 m ii T Σ  xx )()( (3.18) m i1

Next, we find the eigenvalues 21 ,,,  n and the unit-length eigenvectors

21 ,,, vvv n of Σ . The matrix V  21 vvv n  diagonalizes the covariance matrix as

1 V ΣV  D , in which D  21   n ),,,diag( . Rearrange the order of columns of V so that the diagonal elements of D are in descending order. Then the transformation )(  xVy iTi )( rotates the original data to their principal components.

In practice, the PCA usually is calculated using a Singular Value Decomposition (SVD) of XT (Wall, etc., 2003). In this classification model, the PCA transformation matrix is estimated using all the individual trials of the training set at the first step of training. We apply the transformation to both TR and TE to convert the original observations to principal components.

 Bootstrap repetitions

After applying PCA, all the training trials, represented as principal components, are passed to B independent bootstrap repetitions to train B SVM classifiers independently. The structure of each bootstrap repetition is illustrated in Figure 3.6(b). In the ith bootstrap repetition, we randomly draw 80% of the trials in the training set TR to create a bootstrap replication of TR, noted as TR(i) The remaining 20% of the trials in TR are used as a test set TE(i) to monitor the SVM classifier‟s accuracy, although the accuracy rate is not directly related to the final result and won‟t be reported.

 Averaging

Traditionally, it is a widely accepted method in EEG studies that computing the average of multiple trials in a given condition to extract the common structure of a class of signals. The same technique is applied to our research. Averaging cancels out the uncorrelated noise and improves the signal-to-noise ratio. However, the averaging

35 procedure considerably reduces the number of training and testing samples and produces a data-deficiency problem when sophisticated classification models are estimated. Therefore when we compute averages, we have to reuse the individual trials in an efficient way without bringing bias to the classification accuracy. In our classification model, we use a sample-with-replacement scheme to randomly select the trials for computing averages. The sampling should be done for the training and testing sets separately, which means an individual trial used to calculate an averaged training sample cannot be used to compute an averaged testing sample.

For instance, to calculate the average of M EEG individual trials of initial consonant oi as a training sample, we randomly draw M trials from a pool, which is consist of all ni training trials whose corresponding auditory stimuli start with the consonant oi , and calculate their mean. The M trials are put back into the pool for calculating other averages. We repeat this procedure until sufficient number of averaged samples is obtained. When the number of trials in the pool ni is much greater than M, it is very unlikely to have two identical averaged samples.

Note that the PCA algorithm basically rotates the coordinate and aligns the axes with the direction that the signal has greater variance. And if the set of data has large variance in one direction, their averages also have large variance in that direction, given that the number of trials in the pool is much greater than the number of trials used to calculate one average trial. Thus we estimated and applied a PCA transformation to individual trials to avoid repeated SVD calculation, which dramatically slows down the computation.

Moreover, when we compute the p-value, we use the binomial distribution which assumes the testing of each sample is independent. If the averaged test samples are constructed using a sample-with-replacement scheme, two samples may share the same source individual trial and hence are not statistically independent. Therefore we apply the sample-without-replacement scheme on calculating the averaged OOS test samples for accurate p-value estimation, as shown in Figure 3.5(a). Therefore, no

36 more than i Mn averaged samples can be created for the phoneme oi, in which ni is the number of individual trials that belong to class oi in the OOS test set.

 Optimizing SVM parameters

Choosing the appropriate parameters of the SVM model, such as the number of principal components to be kept K and the cost factor C, is crucial to obtain a high performance classifier. In each bootstrap repetition, we determine the optimal parameter of each SVM classifier via nested Q-fold cross-validations using TR(i).

Suppose t independent parameters need to be optimized 21   t ),,,( , the number of candidate values for the parameters are 21  mmm t ),,,( . All the candidate parameter values form a searching grid with m ,,1 t points, notes as

k kP  ,,1, m ,,1 t . The procedure of cross-validation is:

(1) The EEG individual trials of TR(i) are randomly divided into Q group with the approximate even number of trials.

(2) Repeat the following for each cross-validation loop:

(2.1) For the jth cross-validation loop, use the jth group of training trials as the validation test set (VTE(j))and combine the other Q-1 groups as the validation training set (VTR(j)). The averages training and testing samples are calculated using sampling-with-replacement scheme from VTR(j) and VTE(j) respectively.

(2.2) For each point of the parameter searching grid Pk , we estimate an SVM classifier, configured as the candidate parameter values, using the averaged samples of VTR(j) and test its accuracy using the averaged (j) samples of VTE . A classification rate is obtained for Pk , denoted as

j)( Pr k )( .

(3) The mean of cross-validation accuracy rate for is calculated and the optimal parameter set is chosen as:


1 j Pˆ  )( Pr maxarg  j k )( P Q (3.19)

Next, we construct the averaged samples of set TR(i) and train the SVM classifier of the ith bootstrap repetition using the optimal parameter configuration Pˆ .

The computation cost of cross-validation increases exponentially with the number of parameters to be optimized. Thus we cannot afford searching more than three parameters. Since the parameters are optimized independently in each bootstrap repetition, the resulting SVMs may have different structures.

As the last step, the SVM classifiers are aggregated via “majority-voting” and tested on the averaged test samples, which are generated using a sample-without- replacement scheme.

3.3.2 Classification results

We tested the SVM classification model using the Syllables-III data and the Isolated-vowels data. The 1KHz raw data of brainwave were down-sampled 16 times to 62.5Hz. For classifying the initial consonants, only the first 32 samples of each trial, representing EEG signal of 512ms, were used in classification. The full-length trials with 62 samples were used to classify the vowels. Linear vs. Nonlinear Kernels

First, we combined EEG trials from all four subjects of the Syllables-III data and tested the performance of the SVM-with-Bagging classifier using linear and non-linear kernels. We classified the 8 consonants and 4 vowels as two independent classification problems. We also tested the classifier on recognizing 4 vowels in the Isolated-vowels data. The classification experiment is configured as following:

All 124 monopolar channels are concatenated as the EEG observation vector.

The training set TR included half of the individual EEG trials and OOS test set TE contained the other half of the trials. The training/OOS testing partition is random.


35 SVMs were built using the Bagging scheme.

Each averaged sample was computed from 25 individual trials.

For the linear-kernel SVM model, there are only two parameters to be optimized, which are the size of the principal-component vector K and the cost factor C in SVM. Thus the linear-kernel SVM was tested first. For the 8-consonant classification, we used 5-fold cross validation to choose the number of principal components used for classification K from [5,10,15,...,200] and the cost factor of SVMs C from   78  2,,2,2 1 . The mean of the validation rates across all the bootstrap repetitions is plotted in Figure 3.7. We can see that if C is fixed and K increases from 5, the recognition accuracy is improved dramatically with K at the very beginning and the growth rate decreases after K reaches a certain level. The cost factor C affects the sensitivity of the classification rate with respect to K. Although the larger principal components size K leads to a better recognition rate, it also considerably increases the computation cost. With an appropriate C, a better recognition rate can be achieved with a smaller number of principal components.

Figure 3.7: Mean validation accuracy on parameter search grid (K,C) using linear kernel


Now we look at the computation cost of training the SVM-with-Bagging classifier of linear kernel. The parameters were optimized via 5-fold validation searching on a grid of 40×8=320 points. Thus around 320×5+1=1601 times of SVM training is needed to estimate each SVM classifier. An ensemble of 35 such SVM was used to make final predictions. Therefore, in total 1601×35=56,035 times of SVM training should be done for constructing the EEG classification model with linear kernel. For the non-linear kernel experiment, since more parameters need to be optimized, the full search of the parameter grid becomes infeasible. Thus we used the fixed cost factor C=0.02, corresponding to the fastest ascending slope of mean validation rate with respect to K in the linear-kernel experiment, and K=200, while only optimizing for the Gaussian-kernel experiment and optimizing and for the polynomial-kernel experiment. The classification accuracy rate and the significant levels (p-value) are shown in Table 3.2

Table 3.2: Phoneme classification results using SVM-with-Bagging method with linear or non-linear kernels

8 consonants in CV 4 vowels in CV Task syllables syllables 4 vowels (isolated) Number of test samples 426 426 140 rate p-value rate p-value rate p-value Linear 46.0% <10-64 41.5% <10-13 68.8% <10-26 Gaussian 42.7% <10-53 36.9% <10-7 65.3% <10-23 Quadratic 42.7% <10-53 34.3% <10-5 62.4% <10-20 Cubic 43.7% <10-56 41.5% <10-13 65.9% <10-23

The result shows that the linear SVM-with-Bagging model correctly classified 46% of the 426 consonant test samples (p-value<10-64). The classification rates of the 4 vowels in the CV syllables are much lower than the consonant results, with a 41.5% accuracy rate using the linear kernel (p-value<10-13). However, we find that the model works well on the same 4 vowels presented in isolation, achieving a classification rate of 68.8% using the linear kernel. The high significance level proves the effectiveness of the SVM-with-Bagging classification methods on modeling the averaged EEG

40 recordings of auditory phoneme stimuli. But it works much better on the phonemes presented at the beginning of the stimuli than the following phonemes. As mentioned in Chapter 1, the EEG recording reflects postsynaptic potentials, which may last longer than the actual duration of the sound stimuli. Thus the EEG brainwave response of the initial consonants may impose extra noise on the brainwave of vowel perception and make it unintelligible.

Theoretically, the non-linear kernels should achieve at least the same performance as the linear kernel. On the other hand, limited by the computation capabilities of our computers, we could not run a full parameter grid search to find the optimal parameters of non-linear kernels. As a result, the classifiers with non-linear kernels did not outperform the classifier with linear kernel in this experiment. Leave-out-one-subject experiment

With the same experiment setup, we tested the invariance of EEG representations among subjects using the Syllables-III data. We used trials from one subject to create test samples and trials from the other three subjects to train the linear SVM model. This procedure was repeated for four subjects respectively. The result classification rates and p-values are shown in Table 3.3.

Table 3.3: Leave-one-subject-out classification results using SVM-with- Bagging method

8 consonants 4 vowels Subject for Number of Number of testing rate p-value rate p-value test samples test samples DS 138 26.8% <10-5 141 35.5% 0.0036 SA 176 33.5% <10-12 176 30.7% 0.051 LK 280 30.4% <10-10 284 37.0% <10-5 LH 248 25.0% <10-7 248 40.7% <10-7

The classification rates for 8 consonants range from 25.0% (p-value<10-7) to 33.5% (p-value<10-12). Although these rates are considerably lower than the results of the previous experiment, they are highly significant and demonstrate the EEG

41 representations of consonants are approximately invariant among different subjects. In contrast, the vowel classification results, which vary from 30.7% to 40.7%, are comparable to the results obtained from mixing the trials of all the subjects. The result suggests that the EEG representations of vowels have stronger inter-subject invariance than the EEG representation of consonants. Experiment on the number of trials to calculate average

In all above experiments examined in this section, we trained and tested the classification model using the averaged samples. The number of individual trials to calculate average, M, was fixed at 25. To explore how the averaging process affects the classification, we also classified the initial consonants using linear SVM with various M. and plot the relation between M and the percentage of test samples that are correctly classified as in Figure 3.8.

Figure 3.8: The changing of 8 initial consonants classification rates with respect to the number of trials to calculate averages

The figure shows that for classifying individual trials, only 17.3% of the test trials were correctly classified. The accuracy rates increase rapidly as M is increased. The

42 ascending rates slightly slow down when M is greater than 20. Although our data size is insufficient to test when the classification rates are going to saturate, with the given result, we conclude that the averaging can efficiently reduce the signal-to-noise ratio, which verifies that the noise of the EEG signals are mainly uncorrelated for different trials. Experiment on classifying individual EEG trials using data from single channel.

We also classified the individual EEG trials of 4 initial consonants /p/, /t/, /b/ and /g/ from subject LK using the SVM-with-Bagging model, so that the performance can be compared with the Brain-speech mapping classification results shown in Figure 3.4.

In this experiment we combined the 8 sessions from subject LK in Syllable-III data. We randomly drew 75% of the individual EEG trials associated with the targeted phonemes as training set and used the remaining 25% as the OOS test set. To match the experiment setup of Brain-speech mapping classification, we trained and tested the SVM-with-Bagging classifier using individual EEG trials from each of the 124 monopolar channels separately. The parameters setup of the experiment is:

Used 32 samples, represented the first 500ms brain response of each EEG trial as the observation data vector. PCA is not necessary in this case.

Linear kernel was adopted in the SVM-with-Bagging classification model.

35 SVMs were built using the Bagging scheme. Cost factor of the soft-margin SVMs was optimized via nested 5-fold cross- validation loops and chosen from  10  9  2,,2,2 2 .

The result classification rates are shown in the brain map in Figure 3.9. The numbers denote the classification rates using the monopolar channel data collected from the corresponding scalp locations.


Figure 3.9: Performance of SVM-with-Bagging method on classifying 4 initial consonants using single channel data

We can see this brain map matches the brain map in Figure 3.4 very well. Both experiments obtained the highest classification rates from the channels around left ear. The best single channel classification rates are the same which is 36%. The major difference between the two brain maps is: SVM-with-bagging methods can classify the consonants reasonably well using some channels from the right hemisphere. And this is not shown in the classification results of Brain-speech mapping model.

3.4 Summary

In this chapter, we proposed two different approaches to classifying the EEG brainwaves of auditory phoneme stimuli. The first method takes usage of the acoustic properties of the auditory stimuli and examines the relations between the brainwaves and the speech sound waves. The second approach follows the traditional pattern

44 recognition strategy and makes use of statistical signal processing methods such as PCA and SVM-with-Bagging. We used these methods to classify the individual EEG trials and averaged EEG trials. Both methods achieved significant results, especially on classifying the initial consonants. The performance of two classification models are similar when classify the individual trials of 4 initial consonants. The classification rates can be further improved if we bring the two models together. For example the Brain-speech-mapping method can take advantage of the Bagging scheme and PCA methods as well. Using the SVM-with-bagging method, we showed that non-linear methods cannot outperform the linear method in our experiments and EEG representation of phonemes is approximately subject-invariant.

Using the second method as the baseline, in the next chapters, we examine how phonetic differences are reflected in EEG brainwaves and explore if the classification methods can be improved by introducing the phonological information of stimuli into classification.


Chapter 4 Frequency Analysis of EEG Signals

4.1 EEG signals in frequency domain

The long history of studying EEG in the frequency domain started almost at the same time as the first successful recordings of human EEG in the 1920s. Researchers found that EEG signals contain rhythmic activities across a spectrum of frequencies. Oscillations in certain frequency range may reflect a specific cognitive state of the brain. For example, Alpha waves in the frequency range of 8 to12 Hz are believed to have relation with the wakeful relaxation with closed eyes. Beta waves, observed from frontal sensors, which range from 12Hz to 30Hz are closely linked to motor activities. (Pfurtscheller, 1999) Gamma activities in the frequency range from 30 to 100Hz seem related to the binding of neural processes in different brain areas for carrying out a coherent cognitive or motor activity. (Tallon-Baudry & Bertrand, 1999) However, in the literature of EEG related to phoneme perception, the focus of this thesis, researchers were more interested in temporal information. Very few reports on frequency analysis of EEG activities evoked by phoneme stimuli have been published. In this chapter, we address the problem of whether or not frequency analysis can extract attributes of EEG associated to auditory phoneme perceptual activities.

To examine the EEG signals generated by auditory perception in the frequency domain, we first plot the power spectral densities (PSD) of our recordings. Only one EEG session on syllables from subject LK in Syllable-III experiment is examined. The EEG signals were passed through a 1Hz high-pass filter and down-sampled four times to 250Hz. The PSD of each monopolar channels was computed using the covariance method and then the average PSD across all 124 monopolar channels was calculated. This average PSD from 0Hz to the Nyquist frequency 125Hz is plotted in the dB scale in Figure 4.1.


Figure 4.1: Average power spectral densities of EEG signal sampled at 250Hz

The plot shows that besides the dominant 60 Hz AC power supply noise, the power of the EEG signal is mainly distributed in the low frequency range from 0 to 20Hz. And the power is inversely related to the frequency. The 20Hz component has more than -20dB power decay compared to the maxim at 2Hz. It is natural to think that the essential information of brain activities is carried by frequency components with higher energies. Hence reducing the size of the data by down-sampling them to 62.5Hz will not lose much useful information, as we did in Chapter 3. In the following part of this chapter, we will only focus on the lower frequency range from 0Hz to approximate 31Hz.


4.2 EEG spectral features

The Discrete Fourier Transform (DFT) is commonly used to convert the finite discrete time-domain signal to the frequency domain. For a time-domain signal ( ),  Nnnx  1,,0 , the N-point DFT is N complex numbers calculated as:

2 i N 1  nk   )()( N  NkenxkX  1,,0 (4.1) n0

And nx )( can be reconstructed from kX )( using the inverse transformation:

2 i 1 N 1 nk nx )(   )( N  NnekX  1,,0 (4.2) N k0

The real time-domain signal nx )( has a conjugate symmetric spectrum, which means  *    NkkNXkX  1,,1, (4.3) and both X 0  and NX 2/  are real when N is even.

nx N Now we only consider the case when )( is real and is even. If we write the complex number DFT kX )( as

ik )(  k eAkX (4.4)

Then the inverse DFT can be reformulated as:

2 N 1 i( kn ) 1 N k nx )(   k eA N k0 1 N 1 2   Ak cos( knk ) (4.5) N k0 N N 1 1  2 2   0  N    AnAA k cos(2)cos( knk ) N 2 N  k1 

Equation (4.5) shows that the DFT represents the time-domain signal as a superposition of a series of discrete sinusoidal functions, each of which is defined by three attributes: frequency, amplitude and phase.


To explore the relation between these spectral attributes and the phoneme perception process in the brain, we constructed several EEG features based on DFT that reflect all or partial spectral attributes. Then using the SVM classifier introduced in Chapter 3, we tested if the frequency-domain features can be used to predict the brain representation of phoneme stimuli.

 DFT of EEG

The first feature is the DFT, which is calculated separately for each trial and each channel. Since the time domain signal nx )( is real, half of the N-point DFT is redundant. We represent nx )( by

DFT    ,0 Re   NXXXX  ,12/,,1 Im     NXNXX 2/,12/,,1  (4.6)

which is a vector with N non-redundant real numbers. Theoretically, X DFT should be equivalent to the time-domain signal and achieve the same classification accuracy.

 Amplitude of DFT

The amplitude feature of EEG only keeps the amplitude corresponding to each sinusoidal component. From the conjugate symmetric property, we have

kNk  NkAA  1,,1, (4.7)

Thus the non-redundant representation of the amplitudes is:

AMP   10 ,,, AAAX N 2/  (4.8)

 EEG features based on phase

Similarly, when we define EEG features based on the phases of DFT, only N/2 values need to be included as.

X PHS  1, N  (4.9) 2

The EEG signal was passed through several filters at the pre-processing stage to remove the noise and artifacts. A low-pass filter with zero-phase response was used for the purpose of anti-aliasing before down-sampling. But the 4th order Butterworth

49 high-pass filter with the cutoff frequency at 1Hz has non-zero phase response. The frequency response of the high-pass filter at the low frequency range is shown in Figure 4.2. The high-pass filter introduced non-linear phase distortion that needs to be compensated. Suppose the high-pass filter generates a phase delay  k at the frequency

th th of k sinusoidal component, the phase of k sinusoidal component k should be replaced by  kk .

Figure 4.2: Magnitude and phase frequency response of 1 Hz high-pass filter

Furthermore, all the elements in XPHS have angular values, which means k   is identical to k  2 . The linear methods used in classification, such as the averaging and the linear separation hyperplane, cannot work appropriately for angular observation values. Here we propose two different approaches to overcome this problem.

 The first method is to describe the phase angle k as a unit-length vector in the complex plane and use the real and imaginary parts of it, cosk and sink , as the observed values. Then the EEG feature is written as:


X PHS 1   11  N cos,sin,,cos,sin  N  (4.10) 2 2

In the other approach, we keep only the phase information in DFTs and transform them back to the time domain using the inverse DFT. More specifically, for each element of ( ),  NkkX  1,,0 , the modified DFT is defined as:

i ~  k if Ae  0 kX )(   k (4.11)  0 if Ak  0

And the EEG feature PHS-2 is ~ X PHS 2  IDFT( (kX )) (4.12) ~ Obviously kX )( is also conjugate symmetric and the derived time-domain signal should be real. Thus X PHS 2 is a vector of N real numbers, which may be longer than the original EEG data. Although is a time-domain signal, it is constructed based on only the phase pattern of the original signal and all the amplitude differences of the non-zero sinusoidal components are eliminated. Hence is still considered as a phase feature, while all the time-domain signal processing methods are also applied to it.

4.3 Classification results

4.3.1 Compare the EEG features based on DFT

To test how well the EEG spectral features can describe the brain activities of phoneme perception, we computed the four proposed EEG spectral features of the Syllables-III data and the Isolated-vowels data. We used the features to classify the brain representations of the phoneme stimuli with the linear-kernel SVM model introduced in Section 3.3. The classification scheme is identical to the one described in Figure 3.6, except that the EEG time-domain signal of each trial is converted to the spectral features immediately after the ICA cleaning in the pre-processing stage. For


EEG data sampled at 62.5Hz, we used 62 samples as the time-domain observations of one-second EEG activities and zero-pended them for a 64-point DFT. Thus the frequency resolution of DFT is approximately 1Hz. The classifiers were trained and tested using the following configurations:

EEG channels: 124 monopolar channels Number of SVMs for Bagging: 35 Number of trials for averaging: 25 The number of principal components used for classification is optimized in nested 5-fold cross-validation loops and chosen from [5,10,15,...,200] Cost factor of SVMs is optimized in nested 5-fold cross-validation loops and chosen from  10  9  2,,2,2 2 .

The percentages of test samples that classified correctly are summarized in Table 4.1.

Table 4.1: Comparing the classification rates of 4 EEG spectral features

4 vowels in CV 8 consonants syllables 4 isolated vowels number of test samples 426 426 140 Temporal signal (TIME) 41.5% 35.5% 68.8% DFT 38.0% 31.5% 70.2% Spectral AMP 10.3% 28.5% 27.7% features PHS-1 35.2% 29.7% 51.8% PHS-2 39.2% 27.8% 55.3%

The initial-consonant classification results show that the DFT spectral features achieve slightly lower classification rates compared to the full-length temporal signal. The AMP features barely worked for distinguishing EEG representations of initial consonants and classified only 10% accuracy for the classification of 8 consonants, which is not better than the chance level. Hence the amplitudes of DFT carry very little information to distinguish the initial-consonant brain images. Among four spectral features, PHS-2 gave the best classification rate of 39.2%. The accuracy rate

52 is better than that of the full DFT representations. The results demonstrated that eliminating amplitude information can improve the initial consonants classification rates. Similar performance-difference pattern can be found in isolated-vowel brain- image classification, except that the phase-related features cannot classify the vowels as well as the temporal and DFT features. Since the total number of test samples of isolated vowels is 140, which is much less than the number of test samples of initial consonants, 426, the isolated-vowel classification rates are not as robust as the rates of initial-consonants classification.

We also found that the superiority of phases over the amplitudes of the DFT is not shown clearly in classifying EEG brainwaves of the four vowels in CV syllables. This is because the actual start times of the vowels presented at the non-initial position are different, due to the various durations of proceeding consonants. Therefore if the distinctions between EEG images of phonemes are reflected by the phases of DFT, which describe temporal delays of the sinusoid components of signal, these distinctions can be contaminated by the different delays when the phonemes are not presented as the initial phoneme of a stimulus. These results also help us to explain why the classification rates of vowels in CV syllables are significantly lower when compare to others and indicate a possible direction to improve the rates.

Moreover, although the DFT feature is mathematically equivalent to the TIME feature, which means one of them is fully determined by the other, the DFT feature does not reach as high a rate as TIME under the linear classification methods. Similar results were obtained from the PHS-1 and PHS-2 features. This may suggest that the proposed SVM classification algorithm works better on the EEG signal when represented in the time domain.

In short, the remarkable differences in classification rates using spectral features show that the phoneme brain representations are nearly independent of the amplitude of sinusoid components of EEG but much more reflected in their phase pattern.


4.3.2 Frequency selection

Now we know that the phases of the DFT can describe the EEG of phoneme perception process rather well. This is shown by the classification experiments discussed above, which used the spectral properties across the frequency range from DC to Nyquist frequency as the observation features. However, it is natural to think that the frequency components contribute differently to the classification. Those that are unrelated to the target neural activities may impose extra noise on the classifiers and reduce the classification rate. By optimizing the choices of frequency components with respect to maximizing the classification rate, we may be able to find the frequency range of EEG activities that are more directly related to brain processing of phonemes in humans.

The ideal approach is to include a full search of possible frequency choices while training the Bagging SVMs, as we did for the number of principal components and the SVM cost factor. But this will increase the computation time to an impractical level. Thus we look for the approximate optimal frequency range via a 10-fold cross validation using only the training set. Since the number of trials in the Isolated-vowel data is insufficient, we only examine the best frequency range for classifying initial consonants using Syllable-III data.

First of all, we assume the frequency components that carry the information to distinguish the initial consonants lie in a continuous range from fL to fH , corresponding to the frequency indices L and H of a N-point DFT. Our purpose is to find the optimal parameter pair (L, H) in the searching grid  N N       0;,, LHLHL , HL   (4.13)  2 2 

which maximizes the mean classification rates of cross-validation. In this experiment we down-sampled the EEG data to 62.5Hz and applied a 64-point DFT. The optimization procedure is as following:

(1). All the training trials (TR) are randomly divided into 10 groups with approximately the same number of trials.


(2). Repeat the following steps for each pair (L,H) of the grid.

(2.1) Calculate the modified PHS-2 features of the frequency band between L and  NkkX  H and use it as the observation vector of the trial. If ( ), 1,,0 is the DFT of the EEG signal collected from one channel, we calculate:

ik  if k 0 and  HkLAe

~  ik kX )(   if k 0 and  HkNLAe (4.14)   0 otherwise ~ The band-limited PHS-2 feature is the IDFT of kX )( .

(2.2) Transform the modified PHS-2 features into their principal components using PCA. Only the first 200 principal components are used in the following computation. Now each EEG trial is reduced to a vector of 200 elements which are dependent on only the phase pattern of DFTs within the

frequency band , ff HL .

(2.3) Repeat the followings for 10 cross-validation loops.

(2.3.1) At the cross-validation loop i, use group i as the test set ( VTE i)( ) and combine all the other 9 groups as the training set ( VTR i)( ).

(2.3.2) Using the sample-with-replacement method discussed in to create averaged test and training samples, each sample is the mean of PHS-2 features across 25 individual trials. The training set VTE i)( and test set VTR i)( are sampled separately. The total number of training samples is the same as the number of individual trials in VTR i)( and the total number of test samples is same as the number of individual trials in VTE i)( .

(2.3.3) Train an 8-class linear SVM classifier using training samples in VTR i)( with the cost factor C  26 . Then use it to predict the class labels of test samples and compute the percentage of samples that

i classified correctly, noted as r HL ),(


(2.4) The mean classification accuracy of the parameter pair (L, H) is defined as:

1 10 i r  r HL ),( i HL ),( (4.15) 10 1

And the optimal pair (L, H) is chosen as HL  r HL ),( )max(arg),( .

The mean classification accuracies of all the candidate (L, H) pairs that belong to the searching grid   ,40;, HLLHL  20 are shown in figure 4.3. We find that the parameter pair (2, 9) gives the best mean classification rate of 45.5% for 8 classes. The DFTs are approximately corresponding to the frequency range 2Hz to 9Hz.

Figure 4.3: Mean classification rates of parameter pair (L, H) obtained from 10-fold cross validation

To test if the optimal frequency band can be generalized to the OOS test trials and compare the result with other temporal and spectral representations of EEG, we classified the EEG signal of phonemes, represented as the modified PHS-2 feature with limited bandwidth [2Hz, 9Hz], using the linear-kernel SVM-with-Bagging classifier. Although the best frequency range was obtained using the EEG responses of initial consonants, we also applied it to classifying the isolated vowels to check if any improvement can be made.

The experiments were configured as following:

EEG channels: 124 monopolar channels Number of SVMs for Bagging: 35 Number of trials for averaging: 25


EEG observations: modified PHS-2 for 64-point DFT and L=2, H=9. Number of principal components used for classification: 200 Cost factor of SVMs is optimized in nested 5-fold cross-validation loops and chosen from  10  9  2,,2,2 2 .

The classification results are summarized in Table 4.2 in compare with the classification rates of using temporal signal and phase features across the frequency range from DC to Nyquist frequency.

Table 4.2: SVM-with-Bagging classification results using the EEG phase feature in the frequency range from 2Hz to 9Hz

8 initial consonants 4 isolated vowels Temporal Signal 46.0% (500ms) 68.8% 41.5% (1sec) PHS-2 39.2% 55.3% PHS-2 [2Hz, 9Hz] 51.4% 73.8%

For the 8 initial consonants, 219 out of 426 test samples were classified correctly. The accuracy rate is 51.4% with a p-value less than 10-82. The result is significantly better than classifying EEG of initial consonants using the time-domain signal. The classification rate on isolated vowels is also improved from 68.8% to 73.8%.

In conclusion, the EEG features built on the phase pattern of DFT can describe the brain image of the phoneme as well as the original time-domain EEG signal. Eliminating the amplitude information of DFT will not diminish the distinctions of EEG representations of different phonemes at the initial position of auditory stimuli. The phase pattern of sinusoidal components in the frequency range from 2Hz to 9Hz is more important than other frequency components in distinguishing the EEG image of phonemes.


Chapter 5 Invariant Similarities between Brain and Perceptual Representations of Phonemes

Our degree of success in classifying brain representations of phonemes supports the investigation of brain activities at a level below phonemes, i.e. the brainwaves reflecting distinctive features of phonemes compared to perceived phonological features. Intuitively, if two phonemes are perceptually close, the brainwaves evoked by them should be close. In this chapter, we derive the similarities between brainwaves of phonemes according to the confusion matrices of classification results and compare them with the perceptual similarities obtained from corresponding perceptual experiments.

5.1 Psychological experiments on phoneme perception

The phonological features introduced in Chapter 1 are not equally efficient in discriminating phonemes perceptually. Historically, researchers have studied the effectiveness of distinctive features in separating phonemes via psychological experiments. In these experiments, the auditory speech tokens are presented to the listeners, who are instructed to identify the phonemes they heard. The perceptual confusion between each pair of phonemes is recorded. In the typical experiment settings, the utterances are presented via a noisy speech channel with frequency distortions to create the necessary confusions. One of the first psychological experiments on consonants confusion is the renowned Miller and Nicely work. (1955). They recorded the perceptual confusions in identifying 16 consonants, which were

58 filtered and presented with different signal-to-noise ratios (SNR) and used the confusion data to determine the robustness of the distinctive features under filtering or noise-masking conditions. They found that some features, voicing and nasal for instance, are very robust, but the discernibility of the place of articulation is likely to be affected. The Miller-Nicely experiment results were reliably re-produced in 2005 using modern computerized techniques and digital audio recordings (Phatak & Allen, 2008). Wang and Bilger conducted a similar but more thorough experiment in 1973. (Wang & Bilger 1973) They calculated the perceptual confusion matrices of 24 consonants, which covers all the distinctive consonant sounds in most English dialects, in CV or VC syllables with different vowels and evaluated the robustness of phonological features in a variety of context and listening conditions. Relatively less work about vowel confusions has been carried out. Besides the frequency-distortion and noise-masking, researchers also studied phoneme perceptual confusions under other conditions, such as short-term memory (Wickelgren, 1966) and impacted hearing capability (Munson, 2002).

5.2 Similarity measurements

We applied the similarity analysis tools, semi-orders and hierarchical partition trees, to interpret both the brainwave confusions and psychological confusions data. Then the invariance between brainwave similarity of phoneme images and the corresponding perceptual similarity can be derived. We now briefly describe these methods, which follow those of Suppes, Perreau-Guimaraes & Wong (2009).

5.2.1 Semi-Order and Invariant Partial Order of similarities

When we classify the brainwaves of phonemes, we calculate the number of test samples of phoneme oj that are classified as belonging to phoneme oi and normalizing it by the total number of test samples of phoneme oj. Then we get the estimated

 conditional probability | oop ij , where “+” and “-” denote the prototypes and the

59 test samples respectively. If we repeat this for each pair (i, j), a conditional probability matrix is obtained. The normalized confusion matrix of classification results provides empirical evidence to order the similarity differences of the brainwave representations

 of phonemes. Briefly speaking, it is natural to think that the phoneme oi is more

   similar to the prototype o j than the phoneme oi to the prototype o j if and only if

   ij   || oopoop ij   (5.1)

We note this similarity-difference relation as

     || oooo ijij  (5.2)

Because the confusion matrices are generally not symmetric, the similarity differences are not necessarily symmetric. In practice, the difference between the similarity of and and the similarity of and is considered statistically

   insignificant if | oop ij  and  | oop ij   are close enough. Here we introduce the numerical threshold that

        || ijij  iff ij   || oopoopoooo ij     (5.3)

It can be proven that the similarity-difference ordering defined by the estimated conditional probabilities with the numerical threshold  is a semiorder on

   ij    ,,1;,,1:| NjNiooA  i.e. is irreflexive, strongly transitive and an interval order on A.

In our study, we also need to compare the structural invariance between two semiorders – the similarity-difference ordering of brainwaves, noted as  br , and the perceptual similarity-difference ordering of phonemes, noted as  per . The invariance can is given by the intersection of the two semiorders

br per   inv (5.4) which is a strict partial order.


   To graph the semiorders and invariant partial order, the relation   || oooo ijij  is

 illustrated by an arrow from the vertex which denotes | oo ij to the vertex which

  denotes  | oo ij  . To further simplify the graph, we define the congruence relation ≡ as:

 iff allfor , (i) iff , (ii) cabcaccba iff  cb (5.5)

The congruence relation is a strict equivalent relation, i.e. reflexive, symmetric and transitive. In the graph of the invariant partial order, we put and in the same vertex if

     || oooo ijij  (5.6)

Given that the phonemes‟ prototypes are always on the left of the similarity notation and their test samples on the right, the + and – signs can be omitted in the graph without generating any confusion.

5.2.2 Partition tree of similarities

The similarity-difference ordering is the basis of generating a qualitative partition tree, which shows a hierarchical partition of the combined set of test samples and

 prototypes  1 N 1  oooo N },,,,,{ in a binary tree structure. First, we define the

“merged product” of two subsets of O, OI and OJ, as:

     O&O:{OO oooooo  JjIiJiIjijJI }O&Oor (5.7)

The inductive procedure starts from a partition P0 which includes 2N singleton th sets of elements of O. In the k inductive step, two subsets in the partition Pk-1 are chosen to be merged such that the least pair of their merged product under the similarity-difference ordering  is maximized among all the possible merges. Consequently, the subsets with greater similarity are merged earlier than the subsets with smaller similarity in the inductive steps. Each step of the recursive procedure reduces the cardinality of the partition by 1. Thus the 2N -1 step reaches a partition with only one block, which is the set O. The similarity tree is constructed by using the


2N hierarchical partitions in reverse order. The root node of the tree denotes the single set O. The two branches from the root node lead to the partition of the 2N-2 step which has two blocks. The same procedure continues until all the leaves of the tree are the elements of O. The partition tree provides a fairly intuitive approach to summarizing the similarity of the test samples and prototypes in a matrix of conditional probability densities. (Further details of the semiorder and similarity tree can be found in Suppes, Perreau-Guimaraes & Wong 2009.)

5.3 Experimental data analysis

5.3.1 Vowels

Since the classifiers predict the EEG images of isolated vowels much more accurately than those of the vowels in CV syllables, we use the Isolated-vowels data to generate the confusion matrix of EEG representations of vowels. If the sample- without-replacement scheme were used to calculate the averaged test samples, only 140 samples would be available. This is not enough to produce a confusion matrix with reliable off-diagonal structure. Thus when we constructed the vowel confusion matrix, we took the time-domain signal as the EEG observation vector and created 300 averaged test samples for each vowel from the OOS test set using the sample- with-replacement method. In this experiment, the PHS-2 EEG feature with the limited frequency range from 2 to 9 Hz is used to represent EEG signal. The class labels of the test samples were predicted using the Bagging SVM model with linear kernel. As a result, among 1200 test samples, 826 were correctly classified. The classification rate was 68.8%. The normalized confusion matrix of classifying EEG images of vowels is shown in Table 5.1(a). The ith element in the jth row is the probability that the test samples of the phoneme oj are classified as oi. The summation of each row is 100%.

We compare the EEG confusion matrix of vowels with the results of vowel perception experiments conducted by Pickett in 1957. The Pickett experiment

62 presented 12 English vowels in artificial syllables of the form bVb, spoken in a short carrier phrase. They reported the perceptual confusion matrices of vowels when the utterances were masked by noise in various frequency ranges. Considering that the brainwave data were collected in quiet office surroundings, only the perceptual confusion matrix of flat noise is examined here. The perceptual conditional probabilities are estimated by taking the elements associated with the four targeted vowels from Table I(B) in (Pickett, 1957) and forming a sub-matrix. Then divide each element of the sub-matrix by the summation of the corresponding row to get the conditional probabilities shown in Table 5.1(b). The overall perceptual accuracy for these 4 vowels is 82.8%.

Table 5.1: Normalized confusion matrices of 4 vowels

(a) The confusion matrix of EEG isolated-vowel classification. (b) The confusion matrix of Pickett 1957 vowel-perception experiment.

(a) (b)

% i æ u ɑ % i æ u ɑ i 66.3 6.0 21.0 6.7 i 87.0 0.2 11.8 1.1 æ 6.0 79.0 5.3 9.7 æ 0.2 92.6 0 7.2 u 22.0 6.0 65.0 7.0 u 45.3 0.2 53.6 0.9 ɑ 8.3 19.3 7.3 65.0 ɑ 0 1.9 0 98.1

Figure 5.1 compares the similarity trees derived from the brain and perceptual confusion matrices. Looking at the similarity tree for the brain representation of vowels, we can make several remarks. First, any vowel-test is more similar to its own prototype than to any other vowel. A more interesting finding is the separation between open vowels /æ/ and /ɑ/ and close vowels /i/ and/u/. The tree suggests that the brain representation of vowel-height is more robust than vowel-backness. Since the vowel-height reflects the frequency of the first formant (F1) and the vowel-backness is inversely correlated to the second formant (F2), the results suggest that the EEG activity is more sensitive to the low frequency contrast around F1 range (less than 1000Hz) than the higher frequency contrast around F2 (1000-2500Hz). The finding is

63 consistent with the fact that the human cochlea, where the sound wave pressure is converted to the original neural signals, has higher resolution on low frequencies.

The merging pattern of the perceptual similarity tree is almost identical to the brain similarity tree. The slight difference between the brainwave confusions and the perceptual confusions of vowels can be found only in the confusion matrices. In the psychological experiment, although overall 82.8% of the vowels can be perceived accurately, the close vowels /u/ and /i/ are much more confused than the open vowels /æ/ and /ɑ/. This distinction is not found in classifying brainwaves of vowels. As Pickett mentioned, the perceptual intelligibilities of four vowels are highly related to their intensities (Pickett 1957 Table II). We think the strong perceptual confusions between /i/ and /u/ are mainly on account of the fact that the vowels of low intensities are less intelligible when the masking noises are present. Therefore the acoustic distinctions of the vowels, reflected by their locations in the F1-F2 space, are qualitatively mirrored better in the similarity differences derived from the statistical model of EEG images than in the perceptual confusions generated by the masking noise.

The graph of the invariant partial order between brain and perceptual confusions of vowels is shown in Figure 5.2. We computed the intersection using a threshold of eps=0.01. We notice that the pairs with the same height, æ+|ɑ-, ɑ+|æ-, u+|i-, and i+|u- generally rank higher than the pairs that have the same backness, æ+|i-, i+|æ-, u+|ɑ-, and ɑ+|u-. The graph of the invariant partial order also demonstrates that the greater robustness of vowel-height compared to that of vowel-backness in distinguishing the vowels is invariant in the perceptual and brain representations of vowels.


(a) (b) The four vowels /i/, /æ/, /u/ and /ɑ/ are labeled as “i”, “ae”, “u” and “a” correspondingly. (a) The similarity tree of brainwave representation of vowels, derived from the classification results of the linear SVM model. The set of test samples and the prototype of a phoneme are denoted with “-” and “+” respectively. (b) The perceptual similarity tree of vowels, derived from results of Pickett‟s psychological experiments.

Figure 5.1: The similarities of brain representation and perceptual representation of 4 vowels

Figure 5.2: Invariant partial order between brainwave and perceptual confusions of the vowels


5.3.2 Consonants

Among all the experiments of classifying EEG images of consonants, the best result was obtained when we classified the PHS-2 spectral feature limited to the frequency range of 2 to 9Hz using Bagging SVMs with linear kernel. Here we use the same classification model to generate the confusion matrix of the brainwaves of consonants. 300 averaged test samples for each consonant were constructed from OOS test trials using the sample-with-replacement scheme. The classifier correctly predicted the class labels of 1185 test samples out of 2400, with an accuracy rate of 49.4%. The normalized confusion matrix is shown in Table 5.2(a). And we show the resulting similarity tree in Figure 5.3(a).

Table 5.2: Normalized confusion matrices of 8 consonants

(a) The confusion matrix of EEG consonants classification. (b) The perceptual confusion matrix from Miller-Nicely experiment. The ith element in the jth row is the probability that the test samples of the phoneme oj are classified as oi. Each row sums to 100%.


% p t b g f s v z p 38.7 31.0 2.3 7.0 7.3 3.7 6.7 3.3 t 31.7 44.0 2.0 6.0 4.0 3.0 3.7 5.7 b 1.3 1.3 60.0 23.3 0.3 0.3 10.3 3.0 g 6.7 7.7 29.3 40.3 0.7 2.0 8.0 5.3 f 3.0 6.3 0.7 2.0 55.7 15.7 13.3 3.3 s 8.7 5.3 0.7 1.0 13.0 59.7 5.0 6.7 v 11.0 7.3 12.7 11.3 6.7 3.3 38.0 9.7 z 2.7 3.0 8.0 11.7 1.7 7.0 7.3 58.7


% p t b g f s v z p 45.5 33.3 1.0 0.7 13.5 4.2 1.4 0.4 t 40.4 42.2 0.9 0.3 7.5 7.5 0.3 0.9 b 1.3 0.5 52.3 7.2 6.1 2.9 24.3 5.3 g 1.4 0.5 10.8 44.6 0.5 2.4 8.9 31.0 f 12.5 8.7 2.5 0.3 66.2 6.6 2.5 0.8 s 8.3 6.9 2.4 3.1 19.4 55.4 1.4 3.1 v 0.0 0.3 22.6 7.5 3.7 1.6 57.5 6.9 z 2.1 0.4 10.6 19.2 0.7 5.3 14.2 47.5


(a) (b)

(a) The similarity tree of brainwave representations of the consonants. The set of test samples and the prototype of a phoneme are denoted with “-” and “+” respectively. (b) The similarity tree of perceptual representations of the consonants, derived from results of (Miller-Nicely, 1955).

Figure 5.3: The similarities of brain and perceptual representation of 8 consonants

Here we compare the similarities of the brainwave representation of consonants with the perceptual confusion data from the Miller-Nicely (1955) experiment. Only the confusion matrices of the frequency response 200Hz-6500Hz were inspected to match the experimental setup of brainwave data. We calculated the summation of the matrices of Table II and Table III in (Miller & Nicely, 1955), which are the perceptual confusions in the listening condition of SNR=-12db and SNR=-6db respectively, and drew the elements between each pair of the eight consonants from the summation matrix to construct the confusion matrix for the targeted consonants. The accuracy rate of the perceptual confusion matrix, which is the ratio between the sum of the diagonal elements and the sum of all the elements, is 52.3%. It is very close to the classification rates on the brainwaves of consonants and provides a good foundation to study invariance between brain and perceptual representations. The normalized

67 perceptual confusion matrix and the subsequent similarity tree are shown in Table 5.2(b) and Figure 5.3(b).

We make the following remarks about the similarity trees of consonants.

(1) Among three distinctive features being examined, voicing, affrication, and place of articulation, voicing is the most robust feature for both the brain and perceptual representation of consonants, which is shown by the fact that the voiced and voiceless consonants joined together at the last merging in both trees. The robustness of voicing for brainwaves suggests that the temporal difference of the auditory input, such as the voice onset time (VOT), which is the primary acoustic cue for the voicing contrast (Lisker&Abramson 1964), is well preserved in the brain representation.

(2) For voiceless consonants, the affrication is more distinctive than the place of articulation for both brain and perceptual representations of consonants. In fact, the place of articulation is the most confused feature for brainwave representations since 3 of the 4 pairs of the consonants that only differ on place of articulation: /p/ and /t//, b/ and /g/ as well as /f/ and /s/, are merged first.

(3) The major difference lies in the grouping structure of the voicing consonants. Unlike in the brainwave results, where /b/ is mainly attracted by /g/, /b/ is perceptually more confused with the voiced fricative /v/, which shares the identical place of articulation with it. The contrast between /b/ from /g/ is mostly in the transition portion of the F2 of the vowel that follows (Miller & Nicely 1955), while the primary perceptual cues to distinguish /b/ and /v/ are the abrupt onset of the stop sound /b/ and the turbulent noises of the frictions of /v/ (Fujimura & Erickson, 1997). We also notice that although the attraction between /b/ and /v/ is commonly seen in the perceptual consonant categorization data using masking noise (Miller & Nicely 1995; Wang & Bilger 1973; Phatak et al, 2008), it is not clearly shown in the perceptual experiment of short-term memory (Wickelgren, 1966) and the neural activity discriminations of the animals‟ responses to the human speech stimulation (Mesgarani, 2008). Consequently, a possible explanation for the mismatch between the brainwave and

68 perceptual confusions is the fact that friction is more perceptually distorted by white noise than the formant transitions.

The significant invariance between the similarity of brainwave and perceptual confusions of consonants is further illustrated by the invariant partial order graph in Figure 5.4. It shows that the similarity differences between the voiceless stops /p/ and /t/, p+|t- and t+|p- are very small for both brainwave and perception, and lie on the top part of the graph. Although the brain representation of /b/ is mainly confused with /g/, /v/ has strong attraction to /b/ as well. Combined with the fact that /v/ and /b/ are very confused in the perceptual experiment, the similarities v+|b- and b+|v- ranked high in the invariant partial order graph.

Figure 5.4: Invariant partial order between brainwave confusions and perceptual confusions of the consonants

Finally, let us revisit the classification rates. As remarked in Chapter 3, the classifier achieves higher classification rates for the initial consonants than for vowels. For classifying the vowels in CV syllables, this difference may be due to that the cognition process of the initial consonant lasting longer than the actual duration of its sound, thus imposing extra noise on the brainwave of vowel perception and make it unintelligible. The classification model of the averaged trials is more sensitive to the beginning part of the auditory stimuli than the later portion. However, the classification rates on isolated vowels are not as significant as the results on consonants, either. By examining the similarity differences of the EEG image representations of phonemes, we found the EEG observations reflect the temporal

69 distinctions of auditory stimuli, such as VOT, more accurately than the spectral distinctions, such as the formant transitions. This can be another reason that the classifier performs very well on consonants and not as well on vowels. Considering the generally accepted tonotopical organization of the human auditory cortex (Talavage, 2004), this may be due to the relatively low spatial resolution of the EEG signals. Extracting more spatial information from EEG or combining it with other more space-sensitive technologies, such as MEG or fMRI, may improve the classification rates significantly.


Chapter 6 Classifiers Based-on Distinctive Features

6.1 Classifying the distinctive features

The results of Chapter 5 show that phonological distinctions of phonemes, interpreted by the distinctive features, can be revealed in the brainwave of phonemes and captured by our EEG classification model efficiently. The similarity analysis of the phoneme classification results shows that brainwaves of phonemes, which differ on some phonological features, for instance voicing, are not likely to confuse with each other when they are represented in the EEG feature space. The similarities of brain representation of phonemes and perceptual representation of phonemes are approximately invariant. This finding naturally leads to the question that whether we can predict phonological features of EEG brainwaves using the same classification model.

To answer this question, we ran an experiment to classify distinctive features, which take binary values. We classified three distinctive features of initial consonants, voicing, continuant and place of articulation, using 8 sessions of brainwave recordings from Syllables-III data, and classified two features of vowels, height and backness, using 8 sessions of Isolated-vowels data. All the brainwaves used in this experiment were collected from one subject (LK). In total 7168 trials were available for each classification task. When we classified one distinctive feature, the EEG trials were grouped into two classes, which take opposite values on the feature, for example voiced and voiceless. Since our choices of phonemes are balanced on the distinctive

71 features, the number of trials in each class is around 3084. The binary grouping of phonemes for each feature is shown in Table 6.1.

As mentioned in Chapter 1, some distinctive features are widely adopted in most of the feature system, such as voicing, continuant and nasal. But the place of articulation is a more complicated property of the sound for the obstruction may occur at many places along the oral tract. In this experiment, we tested two kinds of grouping for the place of articulation. As in traditional definition of the place of articulation, the 8 initial consonants take three different values: /p/ /b/ /f/ and /v/ are labial; /t/ /s/ and /z/ are alveolar; /g/ is velar. We followed this approach to combine alveolar and velar consonants to form a “non-labial” group, as contrary to labial consonants. We also tested on the feature “coronal”, which was proposed by Chomsky and Halle. Coronal sounds are produced with the blade of the tongue raised from its neutral position. Among the 8 initial consonants, /t/, /s/ and /z/ are coronal while /p/, /b/, /g/, /f/ and/v/ are non-coronal.

The SVM-with-Bagging model with linear kernel was used for the classification. The classifiers were trained and tested using following configurations:

EEG channels: 124 monopolar channels

EEG feature: PHS-2 spectral feature with limited frequency band of 2 to 9Hz

Number of SVMs for Bagging: 35

Number of trials for averaging: 25

Number of principle components used for classification: 100

Cost factor of SVMs is optimized in nested 5-fold cross-validation loops and chosen from  10  9  2,,2,2 2 .

The binary classification accuracies and p-values are shown in Table 6.1 as well.


Table 6.1: Classifying the distinctive features

feature grouping rate p-value voiceless /p/ /t/ /f/ /s/ voicing 92.1% <10-26 voiced /b/ /g/ /v/ /z/ stop /p/ /t/ /b/ /g/ continuant 81.4% <10-13 fricative /f/ /s/ /v/ /z/ consonant place labial /p/ /b/ /f/ /v/ 69.3% <10-5 (labial) non-labial /t/ /g/ /s/ /z/ place coronal /t/ /s/ /z/ 77.9% <10-10 (coronal) non-coronal /p/ /b/ /g/ /f/ /v/ open /æ/ /ɑ/ height 83.8% <10-16 close /i/ /u/ vowel front /æ/ /i/ backness 71.8% <10-7 back /ɑ/ /u/

We found that for all the features under investigation, the classification rates are well above the chance level and p-values are less than 10-7. Among the phonological features of consonants, the classification on voicing achieved the highest accuracy of 92.1%. The classification on continuant had slightly worse result, which is 81.4%. For the place of articulation, the classification on coronal feature achieved significantly better rate than the classification on labial, which indicated that the brainwaves of consonants that different on coronal are more separated than the brainwaves of the consonants different on labial when represented in the EEG feature space. For the features of vowels, vowel-height can be classified more accurately than vowel-backness. The differences of the classification rates are consistent with the similarity analysis results in Chapter 5.

The success in classifying binary distinctive features shows the brainwaves of phonemes, which have the same value on a distinctive feature, are clustered in the EEG feature space. This suggests a new approach to classify the brainwave of auditory stimuli of language constituents. If we code the phonemes, syllables, words, etc as binary features, we can use a small set of binary classifiers to separate them. For example, only 5 binary classifiers are needed to distinguish the 32 syllables in the Syllables-III data. This can be easily generalized to all the phonemes/syllables without making the classification model too complicated. In fact, according to Chomsky and Halle‟s work, all the phonemes in human speech can be represented using 27 binary

73 phonological features with some degree of redundancy. Much less features are needed to represent a certain language such as American English.

Since our brainwave data cover only a small subset of the distinctive features, we can only manage a preliminary study on this approach.

6.2 Distinctive-feature-based classifiers

The main frame of the distinctive feature(DF)-based classification model for the brainwave of phonemes is identical to the SVM-with-Bagging classifier introduced in Section 3.3. The main frame of the classifier is kept intact as in Figure 3.6. To implement the N-class SVM, we use an ensemble of binary SVM classifiers based on the phonological distinctive features of stimuli instead of N(N-1)/2 “one-against-one” binary SVM classifiers.

Suppose the N speech stimuli can be distinguished using k phonological features, each takes two values, noted as 0 and 1. Then each stimulus can be coded as an unique k-bit binary number 21 bbb k , where bi  }1,0{ , with each bit denotes the value of a phonological feature. If we use a two-class SVM classifier to predict one bit of the code, then k SVMs is needed to classify N stimuli. The number of distinctive features should not be less than log2 N . However, since the code is not randomly assigned but reflects phonological properties of the speech stimuli, for a specific classification task, the binary coding may not be very compact, which means the number of distinctive features may be much greater than . Moreover, if the speech stimuli don‟t cover all the possible combination of the distinctive features, it is very likely that the combination of predicted labels of k binary classifiers does not correspond to any stimuli. In our classification model, since the classification results of the N-class SVMs are aggregated via majority-voting, we can drop off any non-decodable result from the SVMs in aggregating.


To test the performance of the DF-based classification model and compare the results with those in previous chapters, we classified 8 initial consonants using all the 24 sessions of Syllables-III data and 4 vowels using Isolated-vowel data. The first three features in Table 6.1: voicing, continuant, place of articulation (labial) are used to distinguish the 8 consonants. Vowel-height and vowel-backness can characterize the 4 vowels. The parameters of classification model are configured the same as the experiments in Section 6.1. The model correctly classified 39.0% of the test samples of initial consonants with a p-value less than 10-42, and 53.5% of the test samples of isolated vowels, with a p-value less than 10-11.

Although the classification rates are not as good as the results obtained in Chapter 5, the success of classifying the brainwave of phonemes using distinctive features indicates that distinctive features may be the underlying mechanism of how the brain parse and retrieve phonemic information when it processes speech inputs. The algorithm also works when only a small amount of training data is available since each binary SVM can be estimated using all the training samples.

6.3 Parallel structure vs. Hierarchical structure

In the DF-based classification model, the binary SVMs, which predict the value of distinctive features of each test sample, are trained in a parallel manner. This requires an underlying assumption that all the distinctive features are represented in brainwaves independently. This assumption of independency can be written as, for any

 ji , the optimal separation hyperplane between class bb ji  )0,0( and class

bb ji  )0,1( is approximately overlap with the optimal separation hyperplane between class bb ji  )1,0( and class bb ji  )1,0( . The assumption is very strict and usually false in practice.


(a) Classifying the vowel-height and vowel-backness independently

(b) Classifying the vowel-height and vowel-backness hierarchically Figure 6.1: Classifying 4 vowels in F1-F2 space


For instance, as we mentioned in Chapter 1, phonological features vowel-height and vowel-backness are closely related to the first formant F1 and the second formant F2 respectively. We cut the auditory stimuli of Syllables-III data as frames of 10ms length and calculate the F1 and F2 of each frame within the segments of vowels sound. Then we plot the frames of the 4 vowels, /i/, /æ/, /u/ and /ɑ/, in the F1-F2 space as shown in Figure 6.1. Now we look for the optimal hyperplane that separates the vowels that differ in vowel-height or vowel-backness in the F1-F2 domain. The blue solid lines in Figure 6.1(a) and Figure 6.1(b) is the optimal separation line between open vowels (/æ/ and /ɑ/) and close vowels (/i/ and /u/) estimated using linear soft margin SVM model with C=0.15 . Only 1.1% of the data points lie on the wrong side of the separation line. The blue dash line in Figure 6.1(a) illustrates the optimal separation line for feature vowel-backness, regardless open or close the vowels are. The samples of front vowels are located below the line and those of back vowels are above the line with 4.7% exceptions. In Figure 6.1(b) the hyperplanes that divide front and back vowels are estimated separately for open vowels and close vowels, shown as a blue dash line and a green dash line correspondingly. Obviously, the blue dash line and the green dash line are apart from each other. It means that the separation hyperplane of vowel-backness is different for open vowels and for close vowels. Therefore the distinctive features of vowels: vowel-height and vowel- backness do not satisfy the assumption of independency. Subsequently, the brainwave representations of the distinctive features will not be independent either.

With this in mind, I proposed DF-based classification model with hierarchical structure. In the hierarchical structure, the decision rules of distinctive features are assumed possibly dependent.

Suppose a test sample x belongs to the class labeled as y, which is coded using k binary distinctive features as  21 bbby k }{ , where bi 1,0  for  ,,1 ki . We use a two-class classifier to predict the value of one feature, and write the decision rule of th ˆ the i classifier as i x)(  bh i , the classification model with parallel structure is described as


 x)(  bh ˆ   1 1  x)(  bh ˆ x   2 2   yˆ (6.1)      ˆ  k x)(  bh k 

But for the hierarchical classification model, the values of distinctive features are predicted in sequent and the classification of ith distinctive feature is depend on the ˆˆ predicted label of previous features, ie. bb i11 . Then the classification process of sample x has a binary-tree structure. Figure 6.2 shows an 8-class classifier with the hierarchical structure.

ˆ DF #1 1 x)(  bh 1

0xx 1xx

ˆ ˆ DF #2 21 x)(  bh 2 22 x)(  bh 2

00x 01x 10x 11x

ˆ ˆ ˆ ˆ DF #3 31 x)(  bh 3 32 x)(  bh 3 33 x)(  bh 3 34 x)(  bh 3

000 001 010 011 100 101 110 111

Figure 6.2: Hierarchical models for classifying 8 classes

Using the hierarchical structure, an N-class classification problem can be solved by N-1 binary classifiers. Since errors are propagated from top to bottom in this structure, the crucial step for constructing a classifier is to find the optimal ordering of the features. Intuitively, the distinctive feature that achieved the highest classification rate in binary classification experiment should be predicted at first to provide a good foundation for further prediction.


The DF-based classifier with hierarchical structure was tested and compared with the classifier with parallel structure using both Isolated-vowels data and initial consonants data of Syllable-III experiment. Only two distinctive features needs to be classified for classifying the four insolated vowels. We tested two possible ordering of the distinctive features, noted as height  backness and backness  height . The percentage of tested samples that were classified correctly and the significant level are summarized in Table 6.2.

Table 6.2: Vowels classification results using DF-based classifiers

rates p-values parallel 53.5% <10-11 height→backness 65.2% <10-23 backness→height 57.0% <10-15

We found that the hierarchical model which classifies vowel-height prior to vowel-backness can correctly classify 65.2% of the test samples, much higher than results from the parallel classifier and the hierarchical classifier with the order . The results are consistent with our prediction that the distinctive feature with higher binary classification rate should be ranked higher in the hierarchical structure.

Hence for classifying the 8 initial consonants, we put voicing, the distinctive feature that can be predicted with an accuracy rate of 92.1% using binary model, on the top of the tree structure. Table 6.3 shows the classification results of the parallel classifier and hierarchical classifiers with the order voicing  continuant  place and voicing place  continuant . The results show that the hierarchical classifier with the DF order achieves the best classification rate of 47.2%.


Table 6.3: Initial consonants classification results using DF-based classifiers

rates p-values parallel 39.0% <10-42 voicing→continuant→place 47.2% <10-67 voicing→place→continuant 40.8% <10-47

Although the best performance of the DF-based classifiers is slightly worse than the best results obtained using N(N-1)/2 one-against-one SVM classifiers, it provides a simpler model that can be easily extended to more complicated speech stimuli. We can also analysis the relation of the distinctive features when they are processed in the brain using the DF-based classification model. For example, we tested a 4-class classification task using 8 sessions from LK of Syllables-III data. Each class contains 2 initial consonants that identical in place of articulation, which are voiceless stops /p/ and /t/, voiceless fricatives /f/ and /s/, voicing stops /b/ and /g/ and voicing fricatives /s/ and /z/. Thus the four classes are distinguished by two distinctive features: voicing and continuant. We tested the classification accuracies of the classification model with parallel or hierarchical structure. The classification results, including the binary classification rate for these two features, are summarized in Table 6.4.

Table 6.4: The results of classifying the combination of voicing and continuant using SVM-with-Bagging model

classification tasks rates 2 classes, voicing v.s. voiceless 92.1% 2 classes, stop v.s. continuant 81.4% 4 classes, voicing + continuant (parallel) 74.3% 4 classes, voicing→continuant (hierarchical) 75.7% 4 classes, continuant→voicing (hierarchical) 76.4%

We found that the 4-class classification accuracy rate of the parallel classifier and two hierarchical classifiers are about the same. They are close to the product of the

80 rates obtained from the binary classification experiments using two features respectively, which is 92 %1. 81 %4.  75 %0. . The results indicate that although the feature continuant cannot be predicted as accurate as the feature voicing in brainwaves, the brain may process the two features independently.

The significant results of classifying the EEG image of phonemes using distinctive features suggest that the human brain may use a distinctive-feature-based parallel computation mechanism to process phonemes.

The hierarchical DF-based classification model can be improved by adopting the algorithms that optimize decision trees. In particular, when we classify more phonemes, more distinctive features are involved. It is impractical to find the optimal structure of decision tree via examining all the possible combinations. Efficient data- driven methods of designing decision trees can be used in this case.


Chapter 7 Conclusion and Prospects

A mathematical model that can recognize the brain activities of phoneme processing is not only essential for developing the language-based brain-computer- interface, it can also provide a powerful method for studying the mechanisms that the human brain uses to process language.

To achieve the goal of this work, developing a mathematical-statistical method to recognize the EEG brainwaves of phonemes, two major problems need to be solved. One is to compress the redundant and noisy EEG data into compact features that contain the crucial information on phoneme perception. The other is to develop statistical methods that can identify the phonemes as represented by the features.

I started my work by solving the latter problem using EEG time-domain signals. Three classification approaches were studied in this thesis. In the first approach, the brain-speech mapping method, we considered the brain as a dynamic system that takes speech, described using acoustic features, as the input and produces EEG brainwaves as outputs. Linear transformations were estimated to simulate the inverse system. The EEG brainwaves can be classified by passing them through the inverse system and comparing the estimated speech input with speech prototypes. In the second approach, purely statistical methods such as PCA and SVM with Bagging are used to construct a classifier. The third approach is a modification of the Bagging SVM method. It classifies phonemes by classifying their distinctive features. All of the three methods were implemented and tested using EEG recordings collected in our lab. The brain- speech-mapping method can classify 44.9% of individual EEG trials of 4 initial consonants of Syllables-I data. The SVM-with-Bagging method achieved the accuracy of 46.0% in classifying the averaged test trials of 8 consonants and the accuracy of 68.8% in classifying the averaged test trials of 4 isolated vowels when the linear

82 kernel was used. However, these methods show limitations in classifying vowels in CV syllables. How to use the knowledge of preceding phonemes to classify the phonemes at the non-initial position of the stimuli is one of the major challenges to extending these methods to classify syllables and words.

The three approaches can be brought together to further improve the classification accuracy. For example, the bootstrap aggregating method can be also incorporate into the brain-speech mapping method. Moreover, the results of classifying initial consonants using data from single channels show that some channels are not contributing to classifying phonemes. The classification model can be further improved if using only the channels closely related to phoneme perception or introducing more sophisticated spatial analysis methods.

Using the SVM-with-Bagging method, I was able to address the first problem and examine the frequency-domain decomposition of EEG. I found the phase pattern of brainwave oscillations in the frequency range from 2Hz to 9 Hz is highly related to phoneme processing. Using the phases of sinusoidal components from 2Hz to 9Hz, the classifier can recognize 51.4% of test samples of 8 initial consonants, improved from 41.5% when the EEG time-domain signal is used.

In this thesis, I also studied the ordinal similarity difference of brainwave representations of phonemes derived from confusion matrices of classification results and compared them with the perceptual similarity of phonemes. The robustness of the feature voicing can be found in both brain and perceptual representation of consonants. And the feature vowel-height is more distinct than vowel-backness in both brain and perceptual representation of vowels. The invariant similarity in brain and perceptual representation of phonemes supports the claim that the brain activities of perceiving phonological features can be effectively observed by measuring the EEG activities and are captured by our detailed model.


