Antje Strauß: Neural oscillatory dynamics of spoken word recognition. Leipzig: Max Planck Institute for Human Cognitive and Brain Sciences, 2015 (MPI Series in Human Cognitive and Brain Sciences; 163)
Neural oscillatory dynamics of spoken word recognition Impressum
Max Planck Institute for Human Cognitive and Brain Sciences, 2015
Diese Arbeit ist unter folgender Creative Commons-Lizenz lizenziert: http://creativecommons.org/licenses/by-nc/3.0
Druck: Sächsisches Druck- und Verlagshaus Direct World, Dresden Titelbild: ©Antje Strauß, 2015
ISBN 978-3-941504-47-9 Neural oscillatory dynamics of spoken word recognition
Der Fakult¨atf¨ur Biowissenschaften, Pharmazie und Psychologie
der Universit¨atLeipzig
eingereichte
Dissertation
zur Erlangung des akademischen Grades
doctor rerum naturalium
Dr. rer. nat.
vorgelegt
von Magistra Artium, Antje Strauß
geboren am 08. Juni 1985 in Blankenburg / Harz
Leipzig, den 01. Oktober 2014
Bibliographic details Antje Strauß Neural oscillatory dynamics of spoken word recognition Fakult¨atf¨ur Biowissenschaften, Pharmazie und Psychologie Universit¨atLeipzig Dissertation 163 pages, 359 references, 22 figures
This thesis investigated slow oscillatory signatures of spoken word recognition. In par- ticular, we aimed to dissociate alpha ( 10 Hz) and theta ( 4 Hz) band oscillations to ⇠ ⇠ understand the underlying neural mechanisms of lexico-semantic processing. Three exper- iments were conducted while recording the electroencephalogram (EEG): i) an auditory lexical decision task in quiet, ii) an auditory lexical decision task in white noise, and iii) an intelligibility rating of cloze probability sentences in di↵erent level of noise-vocoding (spectrally degraded speech). The results show that alpha oscillations play a role during spoken word recognition in three possible ways: First, induced alpha power scaled with lexicality, that is, with the di culty to map the phonological representation onto meaning. Post-lexical alpha power was suppressed for words indicating processing of lexico-semantic information. In turn, alpha power was enhanced for pseudowords indicating the inhibition of lexico-semantic processing. Second, induced alpha power was found to be enhanced at the beginning of words embedded in noise compared to clear speech in line with the presumed inhibitory function of alpha. We propose a framework to further assess the role of alpha in selectively inhibiting task-irrelevant auditory objects. Third, pre-stimulus alpha phase was found to modulate lexical decision accuracy in noise. We interpreted this finding to reflect selective inhibition in the sense that stimuli coinciding with the excita- tory phase were more likely to be thoroughly processed than when coinciding with the inhibitory phase and were thus ultimately judged correctly. Furthermore, we were able to associate theta oscillations with lexico-semantic processing. First, induced theta power was found to be post-lexically enhanced selectively for ambiguous pseudowords that di↵ered only in one vowel from their real-word neighbours. We interpreted this finding in terms of ambiguity resolution of the response conflict induced by their proximity to real words. We suggest that phonemic information needed to be “replayed” in order to re-compare it with long-term memory representations and thus to resolve ambiguity. Second, in high cloze probability sentences theta power was found to be enhanced just before the onset of the sentence-final word, thus indicating the anticipatory activation of lexico-semantics in long-term memory. The results provide novel evidence on the temporal mechanisms in spoken word recognition. These findings are discussed with regard to their implications of the nonlinearity of speech processing and the reassessment of event-related potentials.
Acknowledgements
First of all, I am much obliged to my supervisor Jonas Obleser who converted me to a natural scientist. I owe him most of my knowledge about signal processing, data analysis and the art of typography. He founded the incredible ”Auditory Cognition” group with inspiring and challenging methodological discussions that became almost a family.
Mathias Scharinger sitting to my left and Molly Henry sitting to my right became my scientific foster parents. I would like to thank them for discussing crazy brain measures, playing word games, listening to Bach and Metal simultaneously and their moral support throughout the time. I would like to thank Malte W¨ostmann, Anna Wilsch, Bj¨orn Herrmann, Julia Erb, Alex Brandmeyer and Sung-Joo Lim for the constant critical exchange preferably with a cup of espresso or coke in their hand. I am grateful to Dunja Kunke and a crew of student assistants amongst which were Sergej Schwigon, David Stoppel, Christina Otto, Christoph Daube, Steven Kalinke and Leo Waschke who helped acquiring and preprocessing the data. They created a wonderful working environment.
Furthermore, I would like to thank Sonja Kotz for her initializing ideas and encourage- ments at di↵erent stages of the PhD period. I appreciated a lot the discussions with Hellmuth Obrig about the implications of my results for aphasic patients. Finally, I thank J¨org Jescheniak for accepting and assessing my work.
This work is dedicated to my loving grandfathers Wolfgang Witt and Otto Strauß. It was not granted to both of them to be there to experience the completion of the dissertation with me.
Contents
1 General Introduction 1 1.1 Spoken word recognition and its cognitive e↵orts ...... 2 1.1.1 Psycholinguistic models of spoken word recognition ...... 2 1.1.2 Recognition of spoken word in noise ...... 3 1.2 Spoken word recognition and its neural basis ...... 4 1.2.1 Alpha oscillations and attention ...... 7 1.2.2 Theta oscillations and semantic memory ...... 8 1.3 General Hypotheses ...... 9
2 General Methods 11 2.1 The auditory lexical decision task ...... 11 2.2 Adaptive tracking procedures ...... 15 2.3 Electroencephalography ...... 16 2.3.1 The neurophysiological basis of EEG ...... 17 2.3.2 Preprocessing and artefact rejection ...... 17 2.3.3 Event-related potentials ...... 18 2.3.4 Time–frequency analysis ...... 18 2.3.5 Source localization ...... 20
3 Alpha and theta power dissociate in spoken word recognition 23 3.1 Introduction...... 23 3.2 Methods...... 25 3.2.1 Participants ...... 25 3.2.2 Stimuli ...... 25 3.2.3 Experimental procedure ...... 26 3.2.4 Electroencephalogram acquisition ...... 27 3.2.5 Data analysis: event-related potentials ...... 28 3.2.6 Data analysis: time–frequency representations ...... 28 3.2.7 Source localisation of time–frequency e↵ects ...... 29 3.3 Results ...... 30
i ii CONTENTS
3.3.1 Highly accurate performance ...... 30 3.3.2 Sequential e↵ects of word-pseudoword discrimination in ERPs . . . . 30 3.3.3 Di↵erential signatures of wordness in time–frequency data ...... 31 3.3.4 Source localization of alpha and theta power changes ...... 31 3.3.5 Two separate networks disclosed by an alpha–theta index...... 33 3.4 Discussion ...... 34 3.4.1 Wordness e↵ect in the alpha band ...... 34 3.4.2 Ambiguity e↵ect in the theta band ...... 35 3.4.3 Relationship of evoked potentials and induced oscillations ...... 37 3.4.4 Conclusion ...... 37
4 Alpha oscillations as a tool for auditory selective inhibition 39 4.1 Introduction...... 39 4.2 A framework to test auditory alpha inhibition ...... 39 4.3 A short review of auditory alpha inhibition ...... 41 4.4 Conclusion ...... 43
5 Alpha phase determines successful lexical decision in noise 45 5.1 Introduction...... 45 5.2 Methods...... 46 5.2.1 Participants ...... 46 5.2.2 Stimuli ...... 46 5.2.3 Experimental procedure ...... 47 5.2.4 Data acquisition and preprocessing ...... 47 5.2.5 Data analysis: the phase bifurcation index ...... 48 5.3 Results ...... 50 5.3.1 Accuracy of lexical decisions...... 50 5.3.2 Alpha phase predicts lexical-decision accuracy ...... 50 5.3.3 Accuracy is not predicted by other measures ...... 51 5.3.4 Phase e↵ects in the theta band...... 52 5.4 Discussion ...... 53 5.4.1 Fluctuations in the probability of attentional selection ...... 53 5.4.2 Alpha phase reflects decision weighting ...... 54 5.4.3 Accuracy is not predicted by other measures ...... 55 5.4.4 Theta vs alpha phase e↵ects on lexical decision ...... 56 5.4.5 Conclusion ...... 56 5.5 Supplement Behaviour ...... 57 5.5.1 Introduction ...... 57 5.5.2 Methods...... 57 5.5.3 Results ...... 58 CONTENTS iii
5.5.4 Discussion ...... 60 5.6 Supplement Bifurcation Index ...... 63 5.6.1 Introduction ...... 63 5.6.2 Methods and Results ...... 63 5.6.3 Discussion ...... 66
6 Narrowed expectancies in degraded speech 67 6.1 Introduction...... 67 6.1.1 Semantic context ...... 68 6.1.2 Neural signatures of context in language comprehension ...... 68 6.1.3 Semantic benefits in adverse listening ...... 69 6.2 Methods...... 71 6.2.1 Participants ...... 71 6.2.2 Stimuli and design ...... 71 6.2.3 Pilot study ...... 73 6.2.4 Electroencephalogram acquisition ...... 74 6.2.5 Data analysis ...... 75 6.3 Results ...... 76 6.3.1 Intelligibility rating and reaction time ...... 76 6.3.2 Event related potentials to sentence onset: N100–P200 ...... 77 6.3.3 Event related potentials to sentence-final word: N400 ...... 77 6.4 Discussion ...... 79 6.4.1 N400 and behavioural responses: fast vs. delayed processes . . . . . 81 6.4.2 Prediction capacities and other cognitive resources ...... 83 6.4.3 Conclusion ...... 85 6.5 Supplement Theta power and phase ...... 86 6.5.1 Introduction ...... 86 6.5.2 Methods...... 86 6.5.3 Results ...... 86 6.5.4 Discussion ...... 88
7 General Discussion 91 7.1 Summary of experimental findings ...... 91 7.2 The dissociation of alpha and theta activity ...... 93 7.3 Spoken word recognition as a nonlinear process ...... 94 7.4 N400 and inter-trial phase coherence ...... 95 7.5 Alpha activity along the auditory pathway ...... 97 7.6 Theta oscillations and speech processing ...... 98
References 101 iv CONTENTS
List of Figures 125
List of words and pseudowords 127
List of cloze probability sentences 131
Summary 137
Zusammenfassung 143 Was das Geh¨or betre↵e, so schreibe, und zwar nur auf das Oberfl¨achlichste, soll Konrad zum Baurat gesagt haben, sagt Wieser, entweder ein Arzt, was g¨anzlich falsch sei, oder ein Philosoph dar¨uber, was g¨anzlich falsch sei. Schreibe ein Arzt ¨uber das Geh¨or, sei das v¨ollig wertlos. Schreibe ein Philosoph dar¨uber, sei das auch v¨ollig wertlos. Man darf nicht nur Arzt und man darf nicht nur Philosoph sein, wenn man sich eine Sache wie das Geh¨or vornimmt und an sie herangehe. Dazu m¨usse man auch Mathematiker und Physiker und also ein vollkommener Naturwissenschaftler und dazu auch noch Prophet und K¨unstler sein und das alles in h¨ochstem Maße.
[Konrad is supposed to have said to the inspector [...] that it is usually either a philosopher or a doctor who writes about the human ear. Neither is adequately prepared for the task and in either case they only treat the phenomenon of hearing in the most superficial manner. If a doctor writes about hearing it is entirely worthless. If a philosopher writes about hearing it is equally worthless. When dealing with such a thing as the human ear, one must be more than a doctor, more than a philosopher. One must be a mathematician, and a physicist, a well-rounded scientist in fact. Nor is that enough either, as one must be something of a prophet and an artist, too—and not just of the common kind.]
Thomas Bernhard. Das Kalkwerk. [Loosely translated by A.S.]
1 General Introduction
Understanding deficient speech is a challenge for each listener in everyday life. Noise caused by tra c and construction sites or by interfering talkers such as a group of toddlers on the playground impose problematic hearing situations. Speech might be also internally degraded because of age-related hearing-loss or signal distortions induced by hearing aids or cochlea implants. Besides these acoustic limitations, speech may be e↵ortful to process because a speaker is lisping or mispronouncing words. The ubiquitous issue of listening un- der adverse conditions has consequently been examined by all kinds of scientific branches. Psychologists, for example, have investigated the question of whether cognitive processing capacities are deployed and whether additional attention is allocated to deal with these perceptual challenges. Linguists, furthermore, have asked what kind of speech information enables the listener to compensate for the sparse perceptual evidence and, for example, how knowledge about the semantic context can support the comprehension of upcoming words. The current thesis concerns the interface of both psychological and linguistic per- spectives and aims at answering the outlined questions in a neuroscientific framework by asking about neural temporal dynamics of speech processing under adverse conditions. Speech processing means ultimately that meaning is derived from acoustic-phonetic input that unfolds over time. A spoken word, in particular, is supposed to be processed in analogy to reading in a left to right fashion. That is, as the word unfolds in time more and more linguistic information is accumulated until the word is recognized and semantics can be mapped onto the phonological representation. Word recognition can be achieved as soon as the word becomes uniquely di↵erent from all other possible words (the so-called word recognition point; Marslen-Wilson, 1987). For example, the recognition point of banana occurs at the second /a/ because at this point banana is the only possible word candidate that remains. That means most multisyllabic words can be recognized before the complete word has been heard. The time point of word recognition can even be shifted to an earlier position within the word by embedding the word into sentence context (Miller et al., 1951; Grosjean, 1980). In contrast, if noise is introduced to the acoustic signal, word recognition might be delayed and additional cognitive e↵orts are required in order to achieve semantic mapping. One problem is the increased confusion of segmental information, i.e. vowels and consonants, in noise (Phatak et al., 2008) which necessitates top-down compensatory processes like
1 2 General Introduction attentional e↵orts (R¨onnberg et al., 2013). Also, word recognition in noise can be improved when words are embedded in predictive sentence contexts (Kalikow et al., 1977). The current thesis investigates spoken word recognition in ideal and adverse listening conditions by means of electroencephalography. In particular, it asks about the underlying neural temporal dynamics in case semantic mapping is more e↵ortful. The focus lies on determining signatures of slow neural oscillations and thus extends current knowledge gained by analysing event-related potentials. The following sections provide an overview of current models of spoken word recognition. Then, compensatory strategies to deal with spoken word recognition in noise are outlined. Subsequently, neural oscillations and their putative role in speech processing are introduced. Finally, the general hypotheses for the current thesis are derived.
1.1 Spoken word recognition and its cognitive e↵orts
1.1.1 Psycholinguistic models of spoken word recognition
In spoken word recognition, the basic problem is that an auditory signal unfolding in time needs to be processed such that phonemic evidence needs to be accumulated and mapped onto a representation in long-term memory, i.e. the mental lexicon. The mental lexicon contains all known words of a language together with information about their pronuncia- tion, semantic and syntagmatic relationships (for discussion about its organisation see for example Elman, 2004). Classical ideas about spoken word recognition assume three steps from the acoustic-phonetic analysis to arrive at semantic mapping, namely phonetic iden- tification, lexical selection, and finally integration (Marslen-Wilson, 1987). First, lexical processes are initialized by identifying first phonemes at word onset. Second, more input is received, matching lexical entries can be pre-selected. Lexical search can be for more and more refined. Third, lexical access is accomplished by integrating lexical information and by mapping semantic information onto the phonological representation. One of the most influential models called Cohort implements word recognition as a purely bottom-up driven process, i.e. in analogy to reading from left to right (Marslen-Wilson and Tyler, 1980; for discussion see Norris et al., 2000). Word onsets pre-activate a cohort of possible words and as the signal unfolds and more phonemic information is available, fewer entries of the cohort match until only one of them is left over. Unfortunately in this model, word recognition fails as soon as a wrong phoneme occurs (worst case al- ready at word onset) as the cohort would be immediately empty; it does not allow any feedback loop which would inform the segmental level about lexical knowledge. This im- plementation is contradictory to experimental results showing that participants believe to perceive phonemes that had actually been masked (Warren, 1970; Samuel and Ressler, 1986; Sivonen et al., 2006) or that mispronunciations might stay undetected (Cole et al., 1978) because of overriding lexical knowledge. Spoken word recognition and its cognitive efforts 3
A first attempt to account for this lack has been o↵ered by Trace (McClelland and Elman, 1986) where the identity of a phoneme varies as a function of lexical context, forward as well as backward (for criticism see Grossberg and Kazerounian, 2011). The influence of contextual information, however, has been further developed to arrive at more precise predictions about the accuracy of spoken word recognition. Hence, there are models available which consider the beneficial e↵ect of higher word frequency (Howes, 1954), the interaction between neighbourhood density and frequency (Goldinger et al., 1989; Clu↵ and Luce, 1990; Newman et al., 1997), and confusion matrices for vowels and consonants (Miller and Nicely, 1955; Ladefoged, 2005; Phatak and Allen, 2007) to appropriately weight lexical activation (for example, NAM: Luce and Pisoni, 1998, and Shortlist B: Norris and McQueen, 2008). These models are able to predict word recognition accuracy for words in ideal and adverse listening conditions (for a comprehensive review of these models see Jusezyk and Luce, 2002). In the current thesis, lexical access is studied first by comparing real words and pseu- dowords. Pseudowords closely resemble real words but do not have a representation in the mental lexicon, i.e. they have no meaning. The resemblance, though, triggers some initial lexical search so that by comparing real words and pseudowords successful and failed se- mantic mapping can be investigated. Second, the facilitation of lexical access by preceding semantic context is studied as it reveals how strongly context and target word are asso- ciated with each other. The robustness to dissociate words and pseudowords on the one hand and the robustness to predict words from context on the other hand will be tested by introducing background noise and by degrading the spectral information of the speech signal itself. Therefore in the following section, adverse listening conditions and required cognitive mechanisms to achieve spoken word recognition in noise will be introduced.
1.1.2 Recognition of spoken word in noise
Adding noise to the speech signal increases the confusability among segmental information such as consonants and vowels (Felty, 2007; Phatak et al., 2008). Thus, in order to overcome confusability, compensatory processes are needed to enable word recognition. For example, working memory as a short-term storage with limited capacities (for a review see Awh et al., 2006) can be used for temporary compensation. A recent model by R¨onnberg et al. (2013) suggests that as soon as a mismatch emerges between what can be encoded from the acoustic signal and what is represented in the listener’s mental lexicon additional working memory resources are used to on-line disambiguate confusing speech signals. It has been suggested that listeners with higher working memory capacity experience less listening e↵ort under adverse listening conditions (Pichora–Fuller and Singh, 2006; Rudner et al., 2012). This might be due to the fact that more resources can be engaged in reducing confusability and thus listening e↵ort. Traditionally, the capacity of working memory has been determined by the number of 4 General Introduction items (e.g., words) that can be stored. Recently, a new concept has emerged that ties the capacity also to the encoding precision of each item (Ma et al., 2014). Crucial for the current thesis, if speech is degraded, confusability is high and stimulus encoding cannot be precise. Therefore, more working memory resources are needed to increase encoding precision. The subprocess of working memory which is dedicated to the short-term storage of phonemically coded information is usually referred to as the phonological loop (Baddeley and Hitch, 1974; Baddeley, 2012). Encoding of degraded speech can be improved (thus reducing working memory load) if attention is allocated in order to enhance the task-relevant signal and to suppress the task- irrelevant noise (Broadbent, 1958; for review see Driver, 2001). The top-down increase of the signal-to-noise ratio is defined as the attentional gain (Ling and Carrasco, 2006). People with higher working memory capacities have been found to also more e↵ectively allocate attention linking the concept attention closely to working memory resources (for discussion see Awh et al., 2006). In the current thesis, attentional processes during encoding and retrieval of words will be of main interest. The psychological frameworks of R¨onnberg et al. (2013) and Baddeley (2012) constitute important bridges between psycholinguistic modelling of spoken word recognition as re- ported in the previous section and neuropsychological examinations that will be described in more detail in the next section. For example, attempts to find the neural basis of the phonological loop have helped to describe functions of cortical regions (Paulesu et al., 1993). And the other way round, neuropsychological advances can inform these psycholog- ical frameworks and modify their conception. The continuing search for a single underlying piece of cortex to subserve the function of the phonological loop has failed up to now so that the concept might need to be reconsidered (Buchsbaum and D’Esposito, 2008). One assumption of the current thesis is that phonological loop and attention—both es- pecially beneficial in adverse listening conditions—will be reflected in slow neural oscilla- tions. Oscillatory mechanisms indicate dynamic synchronization of brain areas in certain frequency bands, thus temporarily enabling or inhibiting information processing. These assumed neural mechanisms will be laid out in the following section.
1.2 Spoken word recognition and its neural basis
In cognitive neuroscience, word recognition in the sense of meaning retrieval first had been investigated in the visual domain and by means of electroencephalography (EEG; for a detailed description of the method please see Section 2.3). The most prominent neural correlate of lexico-semantic processing had been found when participants read sentences which were completed either by congruent or incongruent words. Semantically incongruent words elicited a more negative amplitude peaking around 400 ms after word onset in comparison to congruent words (Kutas and Hillyard, 1980). This seminal study triggered Spoken word recognition and its neural basis 5
30 years of experimental work investigating the so-called N400 component (for review see Kutas and Federmeier, 2011; Van Petten and Luka, 2012). Besides semantically incongruent sentence contexts, the N400 has also been found to be sensitive to segmental manipulations. This has been shown by using the lexical decision paradigm which is supposed to tap into lexico-semantic mapping comparable to the context manipulation. In this experimental setting, participants hear words or word-like sounds and are asked to respond whether what they just heard was a word or not. Word-like stimuli or pseudowords are words with some phonotactically legal, segmental (or pho- netic) alterations. Compared to words or phonotactically illegal nonwords, pseudowords elicit larger N400 magnitudes (i.e., absolute amplitudes; e.g., Bentin, 1987; for review see Kutas and Van Petten, 1994). This is in line with the common interpretation of the N400 as a marker of neural processing e↵ort of semantic mapping. Since pseudowords are phono- tactically legal, lexical search is induced but mapping onto lexico-semantic representations in long-term memory is di cult, thus neural processing e↵ort is increased. In sentences, however, neural processing e↵ort is increased because the context is incongruent with the sentence-final word. This led to the view that congruent context facilitates lexico-semantic mapping and therefore reduces the N400 response. In contrast, incongruent context in- creases semantic integration e↵ort and thus increases the N400 magnitude. Although the underlying neural e↵ort of processing phonotactically legal pseudowords compared to processing words with preceding incongruent contexts might be fundamentally di↵erent, still both are reflected in an increased N400 magnitude. Some authors would argue that the astonishing invariance in latency reflects in both cases the initial access to long-term memory independent of word recognition which happens only at a later stage (as described in Section 1.1.1; Kutas and Federmeier, 2011). Here, another perspective will be introduced that emerged only recently which tries to explain linguistic processes based on neural oscillations (Ghitza, 2011; Giraud and Poeppel, 2012; for a discussion about the relationship between neural oscillations and event-related potentials like the N400 component see Section 2.3). The importance of induced neural oscillations for cognitive functioning has been underestimated so far and has often been disregarded as the noise in the EEG signal. Hence, although the N400 appears to be consistent, neural oscillatory patterns might di↵er in both experimental settings and thus might reveal di↵erent involved cognitive functions. In principle, oscillatory accounts on speech processing assume that the temporal structure of the input signal, e.g. a spoken sentence, is coupled to the frequency of the neural oscil- lation applied for processing this information. From the neuronal perspective, it is known that neuronal populations oscillate intrinsically at their preferred frequencies (Buzs´aki and Draguhn, 2004) and because of their resonating characteristics, neurons “select” sensory input based on their preferred frequency range (Schroeder and Lakatos, 2009). This has been suggested to lead to a rhythmic sampling of linguistic information (for a review see 6 General Introduction
Ding and Simon, 2014). Sampling (also often referred to as chunking) evolves because neural oscillations reflect fluctuations in cortical excitability so that if linguistic informa- tion coincides with the excitable phase it is more thoroughly processed than if it coincides with the inhibitory phase. These ideas will be now discussed in more detail. The correspondence between naturally occurring frequency bands in brain oscillations and the rhythms in speech has been modeled computationally, for example, by Ghitza (2011) (an earlier version of the model can be found in Ghitza and Greenberg, 2009). Including some experimental evidence, he argues that delta oscillations ( 1 Hz) sample words or ⇠ prosodic phrases whose physical duration is greater than a second, theta ( 4 Hz) samples ⇠ syllables with durations about 250 ms, beta ( 15 Hz) samples phonemes, and gamma ⇠ (> 30 Hz) samples phonetic features. Poeppel (2003) suggests that sampling frequencies might be asymmetric in the left and right hemisphere of the brain (the so called asymmet- ric sampling in time (AST)-hypothesis). Although speech processing activates primary auditory cortex bilaterally, there might be di↵erent temporal integration windows in higher association areas of the left and right auditory cortices. Based on initial neurophysiologi- cal evidence, he elaborates that the left might sample rapid changes in the gamma range whereas the right integrates over longer time windows in the theta range. One must not forget that there are also neurophysiological reasons why neuron popula- tions would oscillate faster or slower. According to the communication through coherence view (Fries, 2005; Tiesinga and Sejnowski, 2010; Akam and Kullmann, 2012), phase-locked oscillations in the same frequency band indicate information exchange between a↵ected neurons. On the one hand, the frequency range depends on the size of the neuron popula- tions that communicate with each other: the bigger the size of the population, the slower the oscillation frequency (Buzs´aki and Draguhn, 2004). On the other hand, the frequency range depends on the distance between two communicating neuronal populations: the fur- ther apart, the slower the oscillation frequency (Buzs´aki and Draguhn, 2004). In the case of speech processing, both reasonings, type of speech information and neurophysiological constraints, converge because binding of phonetic features a↵ects certainly fewer neuron populations (may be constricted to primary auditory cortex) than semantic integration of several words in a sentence. Another multimodal approach to the chunking idea emphasizes that the auditory cortices might settle on preferred frequencies accommodated to articulation-conditioned speech rhythms. Specifically, syllables are not only characterized by rhythmic acoustic amplitude fluctuations but also by cycling mouth openings. That means that the articulatory motor system generates output that is optimal for the central auditory system to process (Giraud and Poeppel, 2012). This sets the stage for interesting evolutionary reasonings for the correspondence between brain and speech rhythms which are beyond the scope of the current thesis. Beyond the acoustic analysis in auditory cortex, slow neural oscillations have been shown Spoken word recognition and its neural basis 7 to play a role in higher cognitive functions as well. Most important for the purpose of the current thesis as outlined in the previous sections are attention and long-term (or semantic) memory when retrieving lexico-semantic information. Two frequency bands are associated with these functions, namely alpha (8–12 Hz) and theta (3–7 Hz) oscillations, respectively. In the following, both frequency bands will be characterized in detail which finally leads to the specific hypotheses of the current thesis.
1.2.1 Alpha oscillations and attention1
Neural oscillations in the alpha frequency range ( 10 Hz) are the most dominant signals ⇠ measurable in the human magneto- and electroencephalogram (M/EEG), going back to their first description by Hans Berger (Berger, 1931). The earliest observations of the alpha rhythm revealed that its amplitude is enhanced in humans who are awake but not actively engaged in any task. This finding led initially to the view that high alpha power might simply reflect the default state of brain inactivity or “cortical idling” (for a review, see Pfurtscheller et al., 1996). Only within the last two decades, the functional significance of alpha oscillations has been recognized and furthermore its ubiquitous role across sensory modalities (visual: for review see Mathewson et al., 2011; sensorimotor: e.g., Haegens et al., 2012; auditory: e.g., Hartmann et al., 2012) and cognitive tasks (working memory: e.g., Jensen et al., 2002; attention: for a review see Klimesch, 2012; decision making: e.g., Cohen et al., 2009). One unifying mechanism suggested for alpha rhythms across modalities and brain areas is that it provides a neural means to functionally inhibit the processing of currently task-irrelevant or task-detrimental information (Jensen and Mazaheri, 2010; Foxe and Snyder, 2011). The functional inhibition hypothesis has received neurophysiological support. For example, both alpha power (i.e., squared amplitude) and alpha phase modulate neuronal spike rate (Haegens et al., 2011) and thus can directly a↵ect the e ciency of neural information flow. In future work beyond the scope of the current thesis, the alpha network needs to be further characterized by its phase–amplitude coupling to gamma oscillations (Jensen et al., 2012) and its role in top-down control as implemented in di↵erent cortical layers (Bu↵alo et al., 2011; Spaak et al., 2012) or in thalamico-cortical communication (Strauss et al., 2010; Roux et al., 2013). Despite the abundance of studies on the role of alpha activity for visual selective inhibition, there are currently few studies that directly examine the role of alpha activity in the auditory modality. Recently, a series of studies found modulations in alpha power in a variety of auditory tasks prompted by degraded spectral detail (Obleser and Weisz, 2012), missing temporal expectations (Wilsch et al., 2014), working memory load (Obleser et al., 2012; Leiberg et al., 2006), or syntactic complexity (Meyer et al., 2013). Together, these
1This section is adapted from parts of the article published by Strauß, W¨ostmann, and Obleser (2014). Front Hum Neurosci 8,350. 8 General Introduction
findings provide good evidence that alpha oscillatory power can be a reliable indicator of auditory cognitive load (see also Luo et al., 2005; Kaiser et al., 2007). For the current interest in spoken word recognition, alpha oscillations therefore might be for one an important neural mechanism of inhibiting task-irrelevant noise by increasing alpha power. On the other hand, lexico-semantic processing might be indexed by suppressed alpha power, that is enabled neural information flow. In sum, alpha activity might be a neural means to implement attention reflecting what is task-relevant and what is task- irrelevant.
1.2.2 Theta oscillations and semantic memory
Neural oscillations in the theta frequency range ( 4 Hz) have been first described in the ⇠ context of animal studies where they have been observed as the dominant rhythm of the hippocampus (Jung and Kornm¨uller, 1938; Green and Arduini, 1954). Up to today it is not clear how the hippocampal theta and the cortical theta rhythm observed in the human EEG are related to each other (Cantero et al., 2003; Lisman and Jensen, 2013). However, hippocampal theta oscillations have been reliably shown to be associated with memory encoding and retrieval in animals (for review see D¨uzel et al., 2010; Fell and Axmacher, 2011). In humans, depth recordings suggest that theta oscillations are involved in mediating the functional coupling of medial temporal lobe and prefrontal cortex in order to subserve memory functions (for review see Johnson and Knight, 2015). For example, one potential mechanism underlying working memory might be a periodic reactivation of maintained information in theta-timed oscillatory cycles (Fuentemilla et al., 2010). For the current interest in spoken word recognition, this makes theta oscillations a putative neural means to implement the phonological loop sketched in the previous section (Roux and Uhlhaas, 2014). Another issue is the functional overlap between information retrieval from long-term mem- ory and from semantic memory (Ralph, 2014) suggesting to find theta oscillations in semantic manipulations as well. Indeed, the few studies that investigated slow neural oscillations in language processing found theta power to be enhanced, for example, in case semantic knowledge had been violated in a sentence context (Hagoort et al., 2004). Interestingly, theta enhancement has been found over temporal areas if words described auditory contents and over occipital areas if words described visual contents in line with the idea of sensory-specific semantic memory retrieval (Bastiaansen et al., 2008). These results suggest that theta oscillations could play an important role in both manipulations used in the current thesis, that is when lexico-semantics are more or less predictable from context and when comparing words with meaningless pseudowords. General Hypotheses 9
1.3 General Hypotheses
The previous literature review introduced initial ideas about the relationship between slow neural oscillations and spoken word recognition. First, oscillations might be important for the acoustic analysis of the incoming speech signal. Oscillations might chunk speech into smaller units by temporally aligning peaks of neural excitability with the most informa- tive acoustic cues. Hence, e↵ects in the slightly faster alpha frequency range might be observed if vowels had been manipulated and e↵ects in the slightly slower theta frequency range might be observed if lexical semantics had been manipulated. Second, oscillations might dynamically build neural assemblies by synchronizing in one slow frequency band depending on the task-relevant cognitive function. For example, enabled lexico-semantic processing might be reflected by reduced alpha power whereas accessing long-term memory to semantically integrate words in a sentences might be reflected by e↵ects in the theta frequency range. Thus, the current thesis aims at determining slow oscillatory signatures of spoken word recognition. In particular, experimental work will tackle the role of oscillations for un- derstanding how lexico-semantic access is achieved if the auditory signal is ambiguous or degraded. To this end, di↵erent methodological approaches are applied. Emphasis will be laid on, first, the functional dissociation of alpha and theta oscillations during spoken word recognition, and second, the di↵erential signatures of oscillatory power and phase (or phase-locking) in spoken word recognition especially in e↵ortful listening situations. A third interest lies in the relationship between traditionally analyzed event-related poten- tials and the oscillatory patterns to reconsider N400 interpretations. Because the signatures of slow neural oscillations in spoken word recognition are unclear, Chapter 3 first addresses the question how alpha and theta oscillations contribute to lexical access. This problem is approached by using the classical lexical decision task (Marslen- Wilson, 1980) comparing words and word-like pseudowords. We asked whether slow neural oscillations can dissociate lexical integration and ambiguity resolution during lexical ac- cess. In particular, we hypothesized to observe alpha power suppression reflecting enabled lexical integration for real words and to observe theta enhancement for pseudowords re- flecting periodic reactivation of the word-like phonological patterns to resolve ambiguity. Oscillatory patterns will be related to commonly analyzed event-related potentials. Espe- cially, the interpretation of the N400 as a marker of e↵ortful lexico-semantic processing will be reassessed. In the next steps, the auditory signal will be degraded by, first, adding white noise to the speech signal and, second, by noise-vocoding the speech signal (thus reducing its spectral content) to increase confusability and task di culty. The motivation is twofold: On the practical level, experimental results from degraded speech provide insights that can be transferred to special populations (depending on the type of noise e.g., elderly people 10 General Introduction or cochlea implant patients). On the experimental level, degrading the speech signal allows the controlled lowering of word recognition accuracy so that within participants correlations of brain and behaviour are enabled which otherwise would be impossible due to ceiling e↵ects. Robust versus more vulnerable neural processes can be distinguished. As a note of caution, adding white noise to the speech signal might trigger additional processes or alter linguistic processes which might not have been induced in quiet listening conditions. This possibility is discussed as a preface to Chapter 4. The short excursion reviews oscillatory mechanisms to accomplish speech recognition in noise and develops the importance of selective inhibition to suppress irrelevant information (Driver, 2001). The comparison of speech in quiet and in noise is also an interesting case to point out functional di↵erences between induced power and phase-locked oscillations. Chapter 4 paves the way to the analyses of neural phase in Chapter 5. While isolat- ing speech from noise backgrounds might be implemented on the one hand as selective inhibition of the task-irrelevant noise, it might be on the other hand implemented as en- hancement of the task-relevant information, e.g. by allocating attention. In Chapter 5, we test the hypothesis whether the selection of a speech stimulus is reflected by neural phase. This has been shown only for low-level perceptual objects such that stimuli coinciding with the excitable neural phase are more likely to be perceived than when coinciding with the inhibitory neural phase (Lakatos et al., 2005; Henry and Obleser, 2012). Here, we ask whether neural phase e↵ects are also crucial for higher cognitive functions such as spoken word recognition. To answer this question, the lexical decision task is repeated with stim- uli embedded in white noise such that word recognition accuracy is reduced to be 70 % correct. Neural phase is analyzed in the alpha and theta frequencies. Results will give first insights about the generalizability of rhythmic sensory selection (Schroeder and Lakatos, 2009) and the idea of chunking linguistic information (Giraud and Poeppel, 2012). Besides inhibitory and sensory selection, Chapter 6 finally aims to clarify the benefits of semantic context facilitating top-down mechanism to improve word recognition. On the one hand, expectations will be gradually reduced by manipulating the cloze probability of sentence final words (Taylor, 1953; Kalikow et al., 1977; Bloom and Fischler, 1980). On the other hand, the severity of speech degradation will be progressively enhanced in order to uncover the interaction of adverse listening conditions with semantic facilitation. Tra- ditional event-related potentials will be extended by analyses of slow oscillations. Again, N400 interpretations will be reassessed and will again lead to a functional dissociation of induced and phase-locked oscillations. In sum, this thesis extends current knowledge about the neural temporal dynamics of spoken word recognition by analysing not only commonly applied event-related potentials but also by looking at slow neural oscillations. The results will have important implications for clinical populations such as aphasics and cochlea implant patients and will also enhance the knowledge about the neuropsychological mechanisms during spoken word recognition. 2 General Methods
2.1 The auditory lexical decision task
Research on lexical access has used a variety of di↵erent experimental paradigms, one of the most frequent ones being the auditory lexical decison task (first usage by Marslen- Wilson, 1980). In the auditory lexical decision task, participants are asked to judge as quickly as possible whether a just heard sound was a known word or not (“Yes”/“No”; for a critique of the task’s reliability see Diependaele et al., 2012). The experimental paradigm is assumed to tap into lexico-semantic processing. Therefore, it is the method of choice to tackle the current research questions. The auditory lexical decision task has well-known advantages and pitfalls (summarized in Goldinger, 1996). It provides the possibility to contrast processes of word acceptance, non- word rejection (for a modelling approach to distinguish the two see Dufau et al., 2012), and ambiguity resolution, as studied here in Chapter 3. Also, behavioural responses are gath- ered on every single trial allowing data analysis with signal-detection methods (Macmillan and Creelman, 2005; see Chapter 5). One major disadvantage of the auditory lexical decision task is the ecological validity. In everyday life one never has to decide on the lexicality of speech. One rather naturally attempts to assign meaning to what has been heard. Therefore, conclusions about the natural process of lexical access should be treated with caution. As the next paragraph describes in detail, the current design accounts for this by using word-like pseudowords allowing to map the “nonword” response partly onto the more ecologically valid concept of “mispronounced word”. Because reaction times in the lexical decision task have been found to be mainly explained by word frequency (Balota and Chumbley, 1984; Keuleers et al., 2012), in the current set of experiments word frequency has been controlled in order to focus on e↵ects of lexical semantics. When interpreting lexical decision data, it needs to be considered that performing a deci- sion task presumably a↵ects perceptual processes. Two opposing modelling approaches of the relation between word recognition and lexical decision task are available: One model assumes that lexical processing occurs first, so that in a second step decisions can be made depending on whether word recognition was successful or not (Ratcli↵ et al., 2004).
11 12 General Methods
According to the other model, perceptual and decisional processes might be completely integrated (Norris, 2009). Interestingly, both approaches reach the same accuracy in pre- dicting reaction times. Thus, the question about the actual relationship between lexical processing and decisional processes still needs to be answered by neurolinguistic endeav- ours. Implications of the current data concerning this matter will be discussed in Chapter 3 and in Chapter 5.
Stimulus material: words and pseudowords. Stimuli used in Chapters 3 to 5 were adapted from a previous study by Raettig and Kotz (2008) and refined as described below to match requirements for EEG studies and to fit the purpose of the current research questions. From 60 tri-syllabic, concrete German nouns (e.g., /banane/, engl. [banana]; condition is labeled as ‘real’) two types of pseudowords were derived. First, ‘ambiguous’ pseudowords were derived by manipulating the core vowel of the second syllable (e.g., /banene/). Therefore, vowels of the real word conditions were exchanged amongst each other as far as possible, simultaneously considering i) equal exchange proba- bility (i.e., replace /a/ as often with /e/ as with any other vowel), and ii) considering that the cohort is empty at the onset of the manipulated vowel (see Section 1.1.1). By keeping the third syllable intact, the original word remains the only neighbouring real word (e.g., /banane/ is the only real word neighbour of /banene/). Second, ‘opaque’ pseudowords were derived by scrambling the syllables across words while keeping the position-in-word fixed (e.g., /ba poss ner/ consists of the first syllable of · · /ba na na/, the second syllable of /a pos tel/, engl. [apostle], and the third syllable of · · · · /rab bi ner/, engl. [rabbi]). This way, the overall stress patterns and vowel qualities could · · be retained (except for 7 items) so that, for example, reduced vowels in third syllables would not be changed. Furthermore, 60 abstract tri-syllabic real words (e.g., /botanik/) were used as fillers to ensure a balanced word-pseudoword ratio. The complete set of words and pseudowords can be found in the Appendix 7.6.
Psycholinguistic considerations. In contrast to Raettig and Kotz (2008), ambiguous pseu- dowords were manipulated only on the second (not the third) syllable to ensure precise timing necessary in EEG and to experimentally dissect the seemingly sequential process of spoken word recognition according to the following rationale: Auditory word recognition depends heavily on the beginning of the word, that is, the initial syllable (Taft and Forster, 1976; Marslen-Wilson and Zwitserlood, 1989). As studies using word fragment (i.e., first syllable) priming showed (e.g., (Marslen-Wilson and Zwitserlood, 1989; Friedrich et al., 2009; Scharinger and Felder, 2011)), a cohort becomes pre-activated and lexical candidates are isolated. By choosing tri-syllabic German nouns, we were able to use the first syllable to build up initial lexical context, which is identical in all conditions (e.g., /ba / would · amongst other pre-activate /ba nane/, engl. [banana]). · The auditory lexical decision task 13
At the second syllable, we then introduced some variations by either following a potentially pre-activated real word trace (e.g., /ba na /), by exchanging the core vowel (e.g., /ba ne /), · · · · or by replacing the entire second syllable with a random one (e.g., /ba poss /). That · · way, we perturbed further cohort activation (Taft and Hambly, 1986), i.e. the linear accumulation of lexical evidence towards word identification, in two degrees of severity. The third and final syllable, however, completed either a clear pseudoword by continuing the wordness violation (e.g., /ba poss ner/) or it created an ambiguous case by continuing · · the initially expected word despite the local manipulation on the second syllable (e.g., /ba ne ne/). Thus, word identification should be perturbed but remain possible. By pre- · · senting an ending commensurate with the cohort prediction pre-activated at the first sylla- ble, we hypothesized to observe two valid neural strategies of the listener (see Chapter 3). One strategy would be to ignore the local prediction error prompted by the second syllable and to emphasize the global congruence (first and third syllable, as well as suprasegmental features such as prosody) with the overall most likely lexical candidate—comparable to the perception of a slight mispronunciation. The second strategy would be to resolve the ambiguity prompted by the lexical decision task in order to accomplish the task accurately. In Chapters 4 and 5, we explore lexical decisions in noise. Compensatory processes should change because the manipulated vowel is more easily confused with the original vowel so that word recognition accuracy depends on the successful increase of the signal-to- noise ratio by attention allocation. Also, performance in noise might particularly depend on lexical stress patterns, as stress has been found to be preferably used for segmenting speech in noise (Mattys, 2004).
Comparison of vowels. In Chapter 5, real words and ambiguous pseudowords are com- pared. These conditions di↵er by the core vowel of the second syllable only. Vowel iden- tification depends primarily on formant information. Therefore, a post-hoc comparison between formants of ‘real’ and ‘ambiguous’ vowels is conducted as summarized in Figure 2.1 in order to reveal the occurance of any systematic vowel shifts. The Euclidian distance between vowels over three formants is calculated as follows:
3 2 f = (realf amf ) f=1 r X where realf denotes formants of the real word vowel and amf the ones of the ambiguous counterpart. Formant distances varied between 58.8 Hz and 1995.1 Hz but two thirds of the vowel pair distances were below 1000 Hz (Fig. 2.1B). Greater distance leads to less confusability (Felty, 2007). Systematic vowel shifts were assessed by using a bootstrapping procedure because for- mants and formant distances were not normally distributed (see Fig. 2.1B). First and second formants are the most informative dimensions for vowel identification (Ladefoged, 14 General Methods
Figure 2.1: Features of the word and pseudword corpus. A. Vowel space of the second-syllable vowels. Black dots mark real word formants and red dots the ambiguous pseudoword formants. The grey arrow depicts an example shift from /elefant/ to /elufant/. B. Histogram of Euclidian 2 distances of vowels. 3 of all distances, e.g. /e/–/u/, are below 1 kHz. C. Bootstrapping of formant distances between real and ambiguous vowels and their correlation. Ideally, the di↵erence between conditions should be zero which is marked by the thick black line.
2005) which is why this analysis focuses on those two. Sixty di↵erences between real and ambiguous vowel formants were randomly drawn 10.000 times with replacement to gener- ate distributions in Figure 2.1C. Positive values indicate higher formants for real words and negative values higher formants for the ambiguous counterpart. The correlation between formant distances were calculated 10.000 times as well. Unfortunately, some systematicity in the vowel shifts were disclosed. There is a tendency of downward shifting the first formant from real to ambiguous stimuli (Fig. 2.1C left panel; p<0.07). No significant shift of the second formant between conditions occurred (Fig. 2.1C middle panel; p<0.26). Di↵erences of first and second formants are negatively correlated (Fig. 2.1C right panel; ⇢ = 0.19,p<0.05) indicating that the larger the mean first formant distance the smaller the second formant distance. This result was driven by the /u/-sounds in the ambiguous condition which were less variable and clustered at lower first formant frequencies. Although vowel shifts cannot explain the results reported in the following experimental Chapters, future studies might want to control for these in order to reduce performance variability introduced by varying trial-to-trial di culty in discriminating words and pseu- dowords. Adaptive tracking procedures 15
2.2 Adaptive tracking procedures
Word-pseudoword confusion, and vowel confusion in particular, can be enhanced when presenting speech in noise (Felty, 2007). The increase in perceptual uncertainty allows us to study compensatory neural mechanisms of e↵ortful listening on the one hand and neural signatures of successful versus failing lexical access on the other hand. In order to account for the large inter-subject variability in hearing, psychoacoustic measures are inevitable. Adaptive procedures are handy because an individual signal-to-noise ratio (SNR) can be estimated without time-consumingly collecting data to model a listener’s entire psychometric function. Psychometric functions describe the relationship between performance increase and stim- ulus level increase as a sigmoid function (Macmillan and Creelman, 2005; see Fig. 2.2A). The point of subjective equality corresponds to the stimulus level at which both response options (e.g., in lexical decisions “Yes”- and “No”-responses) are equally probable and thus accuracy is about 50 % correct. In the current thesis, individual thresholds to discrimi- nate word and pseudoword vowels were of interest at which participants performed about 70.7 % correct. This allows to analyse the data with signal detection methods because a su cient amount of incorrect trials will be gathered. At the same time, subjective expe- riences of listening frustration is kept limited. Most importantly, though, compensatory neural mechanisms are best observed at an intermediate level of di culty. That is at performance levels above chance, i.e. above 50 % accuracy. 70.7 % accuracy is yielded by a “two-down-one-up” staircase procedure (Levitt, 1971). The procedure is adaptive such that the correctness of the response in one trial determines the SNR of the subsequent one. Here, 70.7 % (but not 50 %) accuracy were targeted which necessitates the following algorithm: If the responses of two subsequent trials were correct, the SNR of the next
Figure 2.2: A. Psychometric function. Dashed lines exemplify how the one-down-one-up staircase procedure would sample the stimulus level to reach 50 % accuracy and how the two-down-one-up staircase procedure would sample the threshold for 70.7 % correct performance. B. Example of the two-down-one-up staircase procedure. Three threshold estimates of one subject (from Chapter 5) are shown. Boxes frame the last 8 reversals, respectively, which were averaged to determine the SNR threshold. 16 General Methods trial decreases so that intelligibility gets worse. As soon as one response is incorrect, the SNR increases so that the next trial is easier. Figure 2.2B illustrates three adaptive track- ing procedures to estimate the threshold for one participant from the dataset reported in Chapter 5. According to Levitt (1971), three parameters are of interest to successfully use adaptive procedures: the initial SNR of the first trial, the step size between two trials and the amount of trials selected to calculating final empirical threshold estimate. First, the initial SNR needs to be set without prior knowledge. If the initial SNR is too far or too close to the assumed empirical threshold, the adaptive procedure becomes ine cient. Second, greater step sizes between two trials are e cient in the beginning of the adaptive procedure to rapidly advance to the empirical threshold. In vicinity of the threshold, smaller step sizes allow a more precise sampling. Third, in order to estimate a reliable threshold, only trials at later stages during the adaptive procedure should be considered, i.e. after several reversals, which closely fluctuate around the empirical threshold. Reversals are defined as the turning points whenever correct responses change to incorrect responses (or vice versa). Here, the duration of the adaptive tracking procedure was set to 12 reversals, but empirical thresholds were determined by averaging across the last eight reversals only, thus discarding the first four reversals. In order to match the adaptive tracking procedure with the auditory lexical decision task as closely as possible in order to yield transferable thresholds, participants performed a discrimination task (instead of a frequently used detection task) during the tracking. To this end, the second syllables were extracted from the real words and their ambiguous- pseudoword counterparts, including the critical vowel manipulation. Syllable pairs were placed successively (the second syllable starts 500 ms after the onset of the first syllable) in a stream of white noise. Syllable pairs might consist of either two times the real-word syllable or the real-word syllable followed by its ambiguous counterpart. Participants had to indicate whether the second vowel in each pair was the “same” or “di↵erent” from the first one. The sound pressure level (SPL) of the white noise was fixed and the SPL of the syllables adapted across trials. Because of the variable formant distances (see Fig. 2.1B), some single trials might be already more di cult at relatively high SNR than others. Therefore, threshold estimation was repeated three times (as depicted in Fig. 2.2B). These three thresholds were averaged to set the final SNR for the subsequent auditory lexical decision task.
2.3 Electroencephalography
In the current thesis, the electroencephalogram (EEG) recorded from the human scalp was used in all experiments to study auditory word recognition. The excellent temporal resolution of EEG in the range of milliseconds (Speckmann and Elger, 2005) is the decisive Electroencephalography 17 advantage over functional magnetic resonance imaging (fMRI) where the temporal resolu- tion is several seconds. Compared to the magnetencephalogram (MEG) in turn, EEG is easier for clinical application in terms of execution and costs. In the following section, the neurophysiological basis of EEG is described followed by the basics of EEG data prepro- cessing and analysis. Beginning with the traditional approach of studying event-related potentials (ERPs) during lexico-semantic processes, the computationally more advanced procedure of time–frequency power and phase analysis is outlined and source localization techniques are introduced.
2.3.1 The neurophysiological basis of EEG
Recording the EEG from the human scalp quite directly measures neural activity. Voltage fluctuations on the scalp are thought to be a consequence of post-synaptic potentials of cortical pyramidal neurons. Complementary to MEG, EEG captures mostly radially ori- ented cortical neuron populations but less so the tangentially oriented ones (Lutzenberger et al., 1985). Excitatory post-synaptic potentials at the apical dendrites, for example, gen- erate electronegativity such that a current flows from the nonexcited and electropositive cell soma to the dendrites (Pizzagalli, 2007). Synchronized firing of neuronal populations can reflect corticocortical or thalamocortical information exchange (discussed below). The greatest limitation of EEG is its spatial resolution. First, because of volume conduc- tion between electrical sources and scalp electrodes, the neuronal signal is captured by neighbouring channels. Second, electrodes measure the sum of huge cortical cell assem- blies not only in terms of spread but also in terms of depth, i.e. thalamic sources. But time–frequency analysis of EEG data allows to some extend (and arguably in a more so- phisticated manner than ERPs) analogy inference between results from intracranial record- ings, for example, from electrocorticography in humans or single-cell recordings in animal studies.
2.3.2 Preprocessing and artefact rejection
Because of the high sensitivity to electric activity, EEG is prone to artefacts arising from line voltage as well as any muscular activity among which are most prominently eye move- ments or heart beat (for an overview, see Blume et al., 2002). Several automatized tech- niques are at hand for artefact detection and removal. Commonly, frequency-based filters are applied first, like high-pass filters to even out slow frequency drifts or notch filters to eliminate the line voltage-specific frequency. Also, artefacts can be identified by their characteristic topographical distributions such as the bipolar frontal activity indicating eye movements (Debener et al., 2010). The independent component analysis (ICA; Jung et al., 2000), for example, is a technique of blind source separation which relies on spatio- temporal characteristics of the signal, and is thus useful to extract (and maybe reject) 18 General Methods statistically independent source signals from the mixture present in raw EEG recordings. After the identification of independent components and their selective rejection, remaining components will be backprojected to re-gain the now artefact-free signal mixture at elec- trodes. The greatest advantage of this procedure is the recovery of trials that otherwise would have been needed to be rejected completely. This is highly relevant in valuable patient data which are often noisier than data from young healthy adults but also benefits the present experiments. This is because analyses of neural phase are heavily dependent on number of trials and in particular the bifurcation index used here becomes more reliable with more trials as will be shown in Section 5.6.
2.3.3 Event-related potentials: advantages and limitations
Studying linguistic processes in the brain has traditionally focussed on event-related po- tentials (ERP) defined as an average over multiple trials time-locked to stimulus onset (Picton et al., 2000; Luck, 2005). Averaging has been thought to diminish the “noise”, i.e. ongoing oscillations, in the EEG signal. Indeed, averaging enhances phase-locked re- sponses and therefore mainly reflects synchronized post-synaptic potentials (see Fig. 2.3 left). However as elaborated below, additional information can be yielded by means of time–frequency analysis distinguishing amplitude and phase in di↵erent frequency bands.
2.3.4 Time–frequency analysis: evoked versus induced activity
Time–frequency analysis decomposes the EEG signal into di↵erent frequency bands allow- ing the frequency-specific investigation of amplitude e↵ects detached (to a certain extend) from phase influences (Makeig et al., 2004). As exemplified in Figure 2.3 (right column), single trials are Fourier transformed and averaged per frequency yielding a time–frequency representation that contains not only evoked but also induced oscillations (Tallon-Baudry and Bertrand, 1999). In Fig. 2.3, the red blob in the left column, i.e. the evoked activ- ity, is also represented in the right column amongst other (induced) activity. Technically, ERPs can be understood as a mixture of evoked, induced, and instantaneous oscillations although the definite relationship between ERPs and oscillations is still a matter of de- bate (Mazaheri and Jensen, 2006; Min et al., 2007; Hanslmayr et al., 2007; Klimesch et al., 2007). Evoked activity, that is what is emphasized by ERPs, is phase-locked to the stimulus onset and consistent across stimulus repetitions. Induced activity, in contrast, is not strictly phase- but time-locked and thus correlates with the experimental condi- tion. Instantaneous (sometimes also called spontaneous) activity is uncorrelated with the stimulation (Herrmann et al., 2005).
Fourier transformation by using wavelets. Di↵erent methods to achieve the frequency decomposition of EEG signals are available. In the current thesis, Morlet wavelets have Electroencephalography 19
Figure 2.3: Schematic extraction of event-related potentials (ERPs) and time–frequency representations. Left column: single trials recorded in EEG after preprocessing. Their average under the thick line represents the ERP. Below the ERP, its representation in time–frequency space shows high synchronization in lower frequency bands right after stimulus onset. Right column: Fourier transform of every single trial. Fourier transformation is achieved by using Morlet wavelets, an example wavelet is depicted between column titles. If averaging is done over each frequency separately, a mixture of evoked and induced oscillations is gained. See text for details. been used for the Fourier transformation (Tallon-Baudry et al., 1997). They are complex functions consisting of Gaussian shaped sinusoidal oscillations (an example is schemati- cally depicted between column titles in Fig. 2.3). The real part of the wavelet function represents a sinusoidal oscillation within a certain frequency band and the imaginary part yields, like a Hilbert transform, a 90° phase-shifted signal (Herrmann et al., 2005). The wavelet function is sliding across the EEG signal and convolved with it at each time point ( the window width). This way, sinusoidal EEG activity is detected. The number of ± sinusoidal cycles in the wavelet function should be frequency-specific since one cycle in lower frequencies will cover longer time-windows than in higher frequencies. If the number of cycles is kept constant across frequencies, time-resolution will be worse in lower than in higher frequencies. At the same time, including more time points in lower frequencies leads to higher frequency resolution than in higher frequencies. To account for this trade- o↵, fewer cycles should be considered for transformation of lower than of higher frequency components. The resulting complex Fourier transform F (f,t) at frequency f and time t has the form F (f,t)=x+yi,wherex represents the real part and y the imaginary part (see Fig. 2.4A). Magnitude (or amplitude in EEG data), also refered to as complex modulus or absolute value of the complex number, is thus defined as F (f,t) = x2 + y2 according to the | | rules of Pythagoras. Power, in turn, which is more often analyzedp in neural oscillation 20 General Methods literature (see also Chapter 3 and 4, and Section 6.5), is defined as the squared magnitude: Power(f,t)= F (f,t) 2. The angle is implicitely given by complex numbers and can be | | y derived by calculating = arctan x . Finally, inter-trial phase coherence (ITPC; used in Chapter 5 and Section 6.5) is defined as
N 1 F (f,t) ITPC(f,t)= n N Fn(f,t) n=1 | | X (Lachaux et al., 1999, 2002) where N is the total number of trials and Fn(f,t) is the Fourier transform of the nth trial. Essentially in the formula, Fourier data is normalized to unit length via division by its magnitude. The sum of normalized Fourier data is then divided by the number of trials. Thus, the absolute part of this complex mean corresponds to the resultant vector length across trials (Berens, 2009). Although mathematically clearly distinct measures, power and phase may not be indepen- dent in EEG data. In empirical data, power may be seen as the envelope of the EEG signal in a specific frequency band and phase as fast fluctuating power. For instance, a sinusoidal oscillation as depicted in Figure 2.4B and C shows with progressing phase from 0° to 90° (or 0 to ⇡/2 rad) simultaneously a change in magnitude (or amplitude for that matter) indicated by the arrows. Although the absolute magnitude is independent of phase, high magnitude across trials and high inter-trial phase-locking often accompany each other. It has been thus argued that phase e↵ects might just be a more sensitive measure of stimulus- locked activity and power and phase should not be seen as complementary measures (Ding and Simon, 2013).
2.3.5 Source localization
When aiming at estimating underlying sources from M/EEG data, the main problem is that there are infinite source solutions to an EEG scalp topography, which is refered to as the inverse problem (Helmholtz, 1853). In order to localize sources as in Chapter 3, a forward solution and the inverse solution needs to be computed. The forward model calculates for each source grid point the resulting scalp topography considering volume conduction. Therefore first, a source model is needed which could
Figure 2.4: A. Complex numbers. B. Unit cycle. C. Cosine function. Electroencephalography 21 be a standard template MRI if, as in the case of the current thesis, no individual MRI is available. Second, a head model extrcated from the MRI anatomical scan is needed, that is a realistically shaped three-layer boundary elements model (BEM) of the brain (Oostenveld et al., 2003) containing information about skin, skull and brain surface. MRI and individual EEG electrode locations need to be co-registered. Then, the so-called lead field can be calculated for each source grid point (in the current thesis with a 1 cm resolution). The lead field contains the forward solution for each source grid point, that means information about how each source grid point is projected onto the surface, i.e. onto each electrode. After calculating the unique solution of the forward model, the inverse model is estimated. Here, a beamforming technique using DICS (Dynamic Imaging of Coherent Sources; Gross et al., 2001) was applied which estimates sources in the frequency domain, in contrast to other beamforming approaches that estimate sources in the time domain such as the Linearly Constrained Minimum Variance (LCMV; Van Veen et al., 1997). First, the EEG signal at each electrode is Fourier transformed by using multitaper based on discrete prolate spheroidal sequences (DPSS, also Slepian sequences; Slepian, 1978; Mitra and Pesaran, 1999; Jarvis and Mitra, 2001). Multiple tapers improve the spectral precision of power estimates. Second, the cross-spectral density matrix (CSD) is calculated by the cross-correlation of two complex signals (Welch, 1967). In DICS, the two submitted signals are all possible electrode combinations (Gross et al., 2001). Because time information is lost in the CSD, only time windows and frequencies of interest are considered. Again, there is a trade-o↵ between longer time windows allowing better frequency resolution and wider frequency ranges improving time resolution. For example when aiming at beamforming 10 Hz-alpha power, a 700 ms time window subsumes 7 alpha cycles and allows a frequency resolution of 1/0.7s=1.4 Hz. The more time points and frequency bins are given the better the final source estimation will be. The last step uses the forward model, i.e. the individual lead fields, and the CSD to compute an adaptive spatial filter for each source grid point. If a common cross-spectral density for baseline and conditions (in a time and frequency window of interest) had been calculated, single-trials of each conditions can be projected by using this so-called common filter into source space. Thus, single-trial power and subsequent statistical contrasts can be estimated.
3 Alpha and theta power dissociate in spoken word recognition2
3.1 Introduction
Accumulating evidence shows that speech comprehension is more completely described by not only looking at evoked but also induced components of the electrophysiological brain response (Giraud and Poeppel, 2012). Besides research concerning the phase (for review see Peelle and Davis, 2012), also power changes of transient slow oscillations have been found to determine language processes (Hald et al., 2006; Bastiaansen et al., 2008; Obleser et al., 2012; Meyer et al., 2013). A functional di↵erentiation between alpha ( 10 Hz) and ⇠ theta oscillations ( 4 Hz), even though previously put forward (Klimesch, 1999; Roux and ⇠ Uhlhaas, 2014; for current debate in audition see Weisz et al., 2011), remains to be shown for speech processing (e.g. an open issue in Obleser et al., 2012; Tavabi et al., 2011). Generally, alpha oscillations are the predominant rhythm in ongoing neuronal communi- cation and therefore observable in diverse cognitive functions such as auditory processing (sometimes labeled “tau”; Lehtel¨aet al., 1997; Tavabi et al., 2011; Hartmann et al., 2012), attention (Klimesch, 2012), working memory (e.g., Meyer et al., 2013; Obleser and Weisz, 2012; Wilsch et al., 2014), or decision making (Cohen et al., 2009). A tentative theoret- ical account on the role of alpha oscillatory activity has only been put forward recently (Jensen and Mazaheri, 2010; Klimesch et al., 2007; Klimesch, 2012): Functional inhibition. In fact, most of the above-cited data are compatible with increased needs for inhibition of concurrent, task-irrelevant, or task-detrimental neural activity. Direct evidence for alpha- mediated inhibition of local neural activity, as expressed in spiking (Haegens et al., 2011) or gamma-band activity (Roux et al., 2013; Spaak et al., 2012), has been provided. First evidence has shown that greater alpha suppression post-stimulus is associated with more e↵ective language processing: alpha oscillations in response to single words were found to be suppressed as a function of intelligibility of acoustically degraded words (Obleser et al., 2012). This is in line with the inhibitional account meaning that al- pha power remains high when the language processing network is inhibited, the crucial mechanism for the present study.
2This chapter is adapted from the published article by Strauß, Kotz, Scharinger, and Obleser (2014). NeuroImage 97,387-395.
23 24 Alpha and theta power dissociate in spoken word recognition
In contrast to functional inhibition across a range of general cognitive functions plausibly associated with alpha, theta oscillations in human EEG have been related more consistently to episodic memory (e.g., Hanslmayr et al., 2009), sequencing of memory content (e.g., Lisman and Jensen, 2013; Roux and Uhlhaas, 2014), and matching of new information with memory content (e.g., Klimesch, 1999). Moreover, neural periodic reactivation of information held in human short-term memory has been directly related to theta-timed oscillatory cycles (Fuentemilla et al., 2010). Such “replay”of sensory evidence in order to arrive at accurate lexical decisions might be decisive in the present design, especially when input is somewhat ambiguous as outlined below. Interestingly, theta power enhancement has been observed in a series of language- or speech-specific e↵ects. For example, semantic violations more than world knowledge vi- olations drive theta enhancement during sentence processing (Hagoort et al., 2004; Hald et al., 2006); also, the retrieval of lexico-semantic information (Bastiaansen et al., 2008) and the increasing intelligibility of acoustically degraded words (Obleser et al., 2012) lead to theta enhancement. In the latter study, the alpha suppression reported above was directly proportional to theta enhancement. These results tie theta enhancements in lan- guage paradigms to the neural re-analysis of di cult-to-interpret stimulus materials. In the present study, we want to dissociate neural oscillatory dynamics in the alpha and theta frequency bands in order to link them to segregable functions in spoken word recog- nition. As a control, however, we also extracted event-related potentials (ERPs) because its N400 component in particular has proven to be a robust index of ‘wordness’ (Chwilla et al., 1995; Desroches et al., 2009; Friedrich et al., 2009; Laszlo et al., 2012; for review see Friederici, 1997; Van Petten and Luka, 2012). Larger N400 amplitudes, elicited by unexpected (Kutas and Hillyard, 1980; Connolly and Phillips, 1994; Strauß et al., 2013), infrequent words (Rugg, 1990; Van Petten and Kutas, 1990; Dufour et al., 2013), or pseu- dowords (Friedrich et al., 2006), compared to high-probable or high-frequent real words, have mostly been associated with increased neural processing e↵ort in matching the input signal to items in the mental lexicon. We aim at elucidating this matching process by investigating alpha and theta activity that are framed in terms of inhibition and replay. We designed an auditory lexical decision task where a word–pseudoword continuum would induce a stepwise reduction in lexical accessibility (‘wordness’). Additionally, ambiguous stimuli would evoke a task-dependent conflict (task: “Is it a word?” (yes/no)) and call for re-evaluation of the auditory input. First, we hypothesize that a ‘wordness’ e↵ect should be observable in the alpha band, with less alpha power when auditory input approximates real words held in the mental lexicon. This e↵ect should be prominent in brain areas associated with lexical processes (e.g., left middle temporal gyrus; Kotz et al., 2002; Minicucci et al., 2013) and would characterize alpha as a signature of enabling lexical integration. Second, we hypothesize that the power of theta oscillations with their ascribed functionality in memory and lexico-semantics would vary with the need for resolving ambiguity. Methods 25
Altogether, our focus on dissociable slow neural oscillations and their corresponding func- tional roles during spoken word recognition allows us to contribute to long-standing debates on whether recognition is best conceived as serial, feed-forward mechanisms (Norris et al., 2000) or as parallel, interacting processes (McClelland and Elman, 1986; Marslen-Wilson, 1987). Importantly, time–frequency analyses of on-going EEG activity are ideally suited to extract potentially parallel cognitive processes.
3.2 Methods
3.2.1 Participants
Twenty participants (10 female, 10 male; 25.6 2.0 years, M SD) took part in an auditory ± ± electroencephalography (EEG) experiment. All of them were native speakers of German, right-handed, with normal hearing abilities, and reported no history of neurological or language-related problems. They gave their informed consent and received financial com- pensation for their participation. All procedures were approved of by the ethics committee of the University of Leipzig.
3.2.2 Stimuli
Adapted from Raettig and Kotz (2008), stimuli were 60 three-syllabic, concrete German nouns (termed ‘real’, e.g., ‘Banane’ [banana]). For the ‘ambiguous’ condition, we ex- changed the core vowel of the second syllable (e.g., ‘Banene’). Finally for the ‘pseudoword’ condition, we scrambled syllables across words (concrete and abstract, see below), while keeping their position-in-word fixed (e.g., ‘Bapossner’). Note that there was a fourth con- dition with 60 three-syllabic, abstract German nouns not relevant for the current analyses which was necessary to maintain an equal ratio of words and pseudowords. These were considered as fillers and not analyzed further. Previous studies used word-like stimuli in order to investigate lexicality e↵ects on phoneme discrimination (Connine and Clifton, 1987; Frauenfelder et al., 1990; Wurm and Samuel, 1997). An important di↵erence to these studies is that we created a distribution of formant distances between real word vowels and their pseudoword equivalents. For illustration purposes, these di↵erence can be quantified by calculating the Euclidian distance of the first three formants for each vowel pair (Obleser et al., 2003): Distances ranged from 200 Hz (/E/ /I/, Geselle ! ! Gesille) to 2100 Hz (/o:/ /i:/, Kommode Kommide). The majority (approximately ! ! one third) of vowel pairs were 600 to 1000 Hz apart from each other (/@/ /O/, Batterie ! Battorie). Therefore, exchanging a vowel here means that stimuli were lexically but ! not phonetically ambiguous which calls for ambiguity resolution processes on a decisional rather than a perceptual level (for discussion see Norris et al., 2000). However, we show with this acoustic analysis that lexical ambiguity necessarily corresponds to variance in acoustic input. 26 Alpha and theta power dissociate in spoken word recognition
Importantly, we controlled for equal ratio of stress patterns across conditions, because in unstressed syllables formant distance decreases, which raises perceptual confusions and task di culty. The substitution of the vowel marked the deviation point to any existing German word but at the same time did not violate German phonotactic rules. The same holds true for clear pseudowords even though deviation points were not as exactly timed as in the ambiguous condition and alternated between the first and second phoneme of the second syllable. Please note that ambiguous stimuli had only one real word neighbor whereas clear pseudowords might have evoked several real word associations. All words and pseudowords were spoken by a trained female speaker and digitized at 44.1 kHz. Post-editing included down-sampling to 22.05 kHz, cutting at zero crossings closest to articulation on- and o↵sets, and RMS normalization. In sum, the experimental corpus consisted of 240 stimuli with a mean length of 754.2 ms 83.5 ms (M SD). ± ±
3.2.3 Experimental procedure
In an electrically shielded and sound-proof EEG cabin, participants were instructed to listen carefully to the words or word-like stimuli and to perform a lexical decision task. Figure 3.1A shows the detailed trial timing. After each stimulus, a delayed prompt indi- cated that a response should be given via button press (“Yes”/“No”) to answer whether or not a German word had been heard. The response delay was introduced in order to gain longer trial periods free of exogenous components (due to the visual prompt) or arte-facts (i.e., button press), which are required for a clean time–frequency estimation and source localisation of oscillatory activity. The button assignment (left/right) was counterbalanced across participants such that 10 participants used their left and the other 10 their right index finger for the ‘Yes’ response. Accuracy scores (percentage correct) and reaction times were acquired. Subsequently, in order to better control for eye-related EEG activity, an eye symbol marked the time period during which participants could blink. Duration of blink break and onset of the next stimulus were jittered to avoid a contingent negative variation. Prior to the experiment there was a short familiarization phase. It consisted of 10 trials taken from Raettig and Kotz (2008) which had similar manipulations but were not used in the present experiment. Then each participant listened to all 240 stimuli. Listeners paused at their own discretion after blocks of 60 trials. The overall duration of the experimental procedure was about 30 min. Each participant obtained an individually pseudo-randomized stimulus sequence. Note that the order of occurrence for a given ambiguous pseudoword (e.g., ‘Banene’) and its real word complement (e.g., ‘Banane’) was counterbalanced across participants in order to control for facilitated word recognition due to ordering e↵ects. As a constraint to pseudo-randomization, their sequential distance was kept maximal (i.e., 120 other items ⇠ in between). Methods 27
Figure 3.1: Study design and behavioural measures. A. Stimulus design and schematic time course of one trial. Stimuli were tri-syllabic German nouns (‘real’), ‘ambiguous’ pseudowords (one vowel exchanged), and clear ‘pseudowords’ (scrambled syllables across items). B. Mean percentage correct 1 SEM (between-subjects standard error of the mean). *** p<0.001, ** p<0.01, * p<0.05 C.± Mean reaction times relative to the prompt, 1SEM.D. Grandaverage of ERPs over midline electrodes. Grey shaded bars indicate statistical± di↵erences.
3.2.4 Electroencephalogram acquisition
The electroencephalogram (EEG) was recorded from 64 Ag–AgCl electrodes positioned according to the extended 10–20 standard system on an elastic cap with a ground electrode mounted on the sternum (Oostenveld and Praamstra, 2001). The electrooculogram (EOG) was acquired at a horizontal (left and right eye corner) and a vertical (above and below left eye) line. All impedances were kept below 5 k⌦. Signals were referenced against the left mastoid and digitized online with a sampling rate of 500 Hz, with a pass-band of DC to 140 Hz. Individual electrode positions were determined after EEG recording with the Polhemus FASTRAK electromagnetic motion tracker (Polhemus, Colchester, VT, USA) for more precise source reconstructions. 28 Alpha and theta power dissociate in spoken word recognition
3.2.5 Data analysis: event-related potentials
Data pre-processing and analysis was done o✏ine by using the open source Fieldtrip tool- box for MatlabTM, which is developed at the F.C. Donders Centre for Cognitive Neu- roimaging in Nijmengen, Netherlands (Oostenveld et al., 2011). Data were re-referenced to linked mastoids and band-pass filtered from 0.1 Hz to 100 Hz. To reject systematic artefacts, independent component analysis was applied and components were rejected ac- cording to the ‘bad component’ definition by Debener et al. (2010). Remaining artefacts were removed when the EOG channels exceeded 60 µV for frequencies between 0.3 and ± 30 Hz, which led to whole trial exclusion (3.6 5.3 trials per participant). Resulting clean ± data were used for subsequent analyses. To extract event-related potentials (ERPs), epochs were low-pass filtered using a 6th order Butterworth filter at 15 Hz, baseline-corrected (baseline – 0.2 to 0 s), and then averaged over trials per condition. As in previous studies (Strauß et al., 2013; Obleser and Kotz, 2011), auditory evoked potentials were considered to be strongest over midline electrodes (FPz, AFz, Fz, FCz, Cz, CPz, Pz, POz, Oz), which were defined as a region of interest (ROI) for the ERP analysis, best capturing the dynamics of the N400 component. On the ERP amplitudes, we performed a time series analysis (49 consecutive steps of 50 ms width, windows overlap by 25 ms thereby covering a time range from 0 to 1.25 s) using repeated measures ANOVA with the factor of wordness (pseudo, ambiguous, real). We assessed p values with Greenhouse–Geisser-corrected degrees of freedom. If p values survived false discovery rate (FDR) correction for multiple comparisons (i.e., time windows), post-hoc t tests for pairwise comparisons were performed within these time windows.
3.2.6 Data analysis: time–frequency representations
In order to obtain time–frequency representations (TFRs), clean data were re-referenced to average reference. This is important for comparability with source analysis since the forward model needs a common average reference as well. For power estimates of non- phase-locked oscillations, Morlet wavelets were used on single trial data in 20-ms steps from – 700 to 2100 ms with a frequency-specific window width (linearly increasing from 2 to 12 cycles for frequencies logarithmically-spaced from 3 to 30 Hz). Single trials were subsequently baseline-corrected (against the mean of a – 500 to 0 ms pre-stimulus window of all trials) and submitted to a multi-level or “random e↵ects” statistics approach (for application to time–frequency data see e.g., Obleser and Weisz, 2012; Henry and Obleser, 2012). On the first or individual level, massed independent samples regression coe cient t tests with condition as dependent variable and contrast weights as independent variable (cho- sen correspondently to our e↵ects of interest, see below) were calculated. Uncorrected regression t values and betas were obtained for all time–frequency bins. According to our Methods 29 hypotheses, our e↵ects of interest were a ‘wordness’ e↵ect, namely a linear trend [pseudo > ambiguous > real], but also a stimulus-specific or ‘ambiguity’ e↵ect [ambiguous > (pseudo, real)]. On the second or group level, the betas were tested against zero in a one tailed dependent sample t test. A Monte-Carlo non-parametrical permutation method (1000 randomisa- tions) as implemented in the Fieldtrip toolbox estimated type I-error controlled cluster significance probabilities (↵<0.05). To evaluate the influence of baseline correction, we repeated first and second level statistics on absolute power estimates (skipping single trial baseline correction).
3.2.7 Source localisation of time–frequency e↵ects
Source localisation for resulting clusters followed the Fieldtrip protocol on source re- construction using beamformer techniques (e.g., Medendorp et al., 2007; Haegens et al., 2010; Obleser et al., 2012; Obleser and Weisz, 2012). In short, an adaptive spatial filter (DICS—Dynamic Imaging of Coherent Sources; Gross et al., 2001) based on the cross- spectral density matrix was built by estimating the single trial fast Fourier transformation of time windows and smoothed frequencies of interest (TOI and FOI) using a set of Slepian Tapers (Mitra and Pesaran, 1999). TOI and FOI were determined according to cluster re- sults in sensor space but computational considerations were also taken into account (more time and frequency smoothing allows better spatial estimation): For theta, estimates were centred around 4.5 Hz ( 2.5 Hz smoothing) and covered a 700 ms time window from 500 ± to 1200 ms, thus, three theta cycles and three tapers were used. For alpha (10 Hz 2 ± Hz smoothing), a 700 ms time window was defined centred around 1000 ms, which covers approximately seven alpha cycles and results in two tapers. For source localisation, the individual EEG electrode locations were first co-registered to the surface of a standard MRI template (by applying rigid-body transformations using the ft electroderealign function). By co-registering to this template, a realistically shaped three-layer boundary elements model (BEM) provided by the Fieldtrip toolbox (Oosten- veld et al., 2003) based on the same template was used. We were then able to calculate individualised forward models (i.e., lead fields) based on individual electrode positions and a standard head model for a grid with 1 cm resolution. Using the cross-spectral density matrices and the individual lead fields, a spatial filter was constructed for each grid point, and the spatial distribution of power was estimated for each condition in each subject. A common filter was constructed from all baseline and post-trigger segments (i.e., based on the cross-spectral density matrices of the combined conditions). Subject- and condition-specific solutions were spatially normalized to MNI space and averaged across subjects, and then displayed on an MNI template (using SPM8). Figure 3.2 (column 4) shows the result of cluster-based statistical tests (essentially the same tests as used for the electrode-level data before) that yielded voxel clusters for covariation of source power 30 Alpha and theta power dissociate in spoken word recognition with the alpha and theta e↵ect, respectively. This was mainly done for illustration pur- poses, and unlike the tests for channel–time–frequency clusters in sensor space, no strict cluster-level thresholding was applied. We plotted t values on a standard MR template, and MNI coordinates mentioned in the figure caption refer to brain structures that showed local maxima of activation. In order to visualise the specificity of the neural networks for either alpha or theta frequency range oscillations, we calculated an index using the t values of the wordness t↵- and the ambiguity t✓-e↵ect and divided their di↵erence by their sum: t t i = ✓ ↵ (3.1) ↵✓ t + t | ✓| | ↵| The index has been calculated only for those grid points which exceeded the critical value of t19 =1.7291 in the source space solution. As such, only areas are highlighted which either show an alpha (blue) or theta (red) e↵ect. This resulted in a descriptive source map as shown in Figure 3.3. Values around zero indicate non-dominance for either network (i.e. green in the figure).
3.3 Results
3.3.1 Highly accurate performance
The performance of the lexical decision task after each trial revealed high accuracy overall (> 95% in each condition, see Fig. 3.1B). Nevertheless, an ANOVA with the three- level factor wordness was significant (F2,38 = 28.54,p < 0.001) with lowest accuracy for ambiguous pseudowords (ambiguous vs. real: t = 4.16,p < 0.001; ambiguous vs. 19 pseudo: t = 8.01,p<0.001). Highest accuracy was found for proper pseudowords (vs. 19 real: t19 =2.18,p < 0.05, indicating some confusion of ambiguous pseudowords with real words. Since the response was prompted with delay, e↵ects on reaction time were neither expected nor found (F2,38 =1.582,p=0.221, see Fig. 3.1C).
3.3.2 Sequential e↵ects of word-pseudoword discrimination in ERPs
Overall, the ERPs over midline electrodes show the typical pattern of an N1–P2 complex followed by a later N400-like deflection in all conditions (see Fig. 3.1D). Binning the ERP in 50 ms time windows with 25 ms overlap and testing for condition di↵erences (repeated measures ANOVA, threefold factor wordness) showed no di↵erences in amplitude before 500 ms post stimulus onset: There were no di↵erences in the N1 or P2 (F<1). The repeated measures ANOVA yielded significantly di↵erent amplitudes from
0.5 to 1.2 s (mean F2,38 = 13.19,p<0.01 after FDR correction). Furthermore, post-hoc t tests on the ERP amplitudes confirmed a regrouping of conditions over time: pseudowords di↵ered from real words over the whole time course (pseudo > real from 0.5 to 1.125 s, Results 31 mean t = 4.62, p < 0.01); ambiguous stimuli initially di↵ered from real words 19 mean (ambiguous > real from 0.525 to 0.825 s, mean t = 4.27,p < 0.01), but regrouped 19 mean with real words later, di↵ering from proper pseudowords (pseudo>ambiguous from 0.85 to 1.2 s, mean t19 =3.1,pmean <0.01; Fig. 3.1D, gray-shaded inlay).
3.3.3 Di↵erential signatures of wordness in time–frequency data
As seen in the grand average TFRs in Figure 3.2 top row, frequencies of the theta range (3–7 Hz) were enhanced, first phase-locked to stimulus onset around 200 ms, and, with markedly decreased phase-locking, from 400 to 1000 ms after stimulus onset. In contrast, alpha power (8–12 Hz) was suppressed during the whole time course of a trial with the lowest power around 800 ms. For assessing relative power changes, a multi-level statistics approach was chosen as de- scribed in the methods section. A linear contrast was set on the first level for test- ing the wordness e↵ect [real > ambiguous > pseudo]. On the second-level, the cluster permutation test, testing the first-level betas against zero, revealed one positive cluster
(Tsum =8, 319.8,p<0.05) covering mainly lower- and mid-alpha frequencies (peak at 9.3 Hz and 0.88 s; Fig. 3.2 bottom row). In general broadly distributed, the cluster showed the largest statistical di↵erences over the left frontal and right and left central electrodes (Fig. 3.2 bottom row fourth column). Extracted power values from the cluster (8–12 Hz, 0.88 0.06 s) confirmed significant di↵erences between all three conditions (post-hoc paired ± t tests: real vs. ambiguous: t19 =2.32,p < 0.05; real vs. pseudo: t19 =4.66,p < 0.01; ambiguous vs. pseudo: t = 2.09,p < 0.05). When using absolute power, the positive 19 cluster (Tsum = 39, 928; p<0.001) showed a similar distribution over frequency and time with peak e↵ects at 10.7 Hz and 0.9 s over left anterior electrodes. Interestingly, testing the ambiguity e↵ect [ambiguous > (pseudo, real)] using the same statistical approach revealed one positive cluster (Tsum = 8134.6; p<0.05) in the theta frequency range (peak at 5.2 Hz, 0.94 s; Fig. 3.2 middle row). Scalp topographies suggested two foci, one at the left-central anterior electrodes and the other at the parietal electrodes. Further, post-hoc paired t tests on power values extracted from the cluster (3–7 Hz, 0.88–1.1 s) confirmed that pseudowords and real words did not di↵er from each other
(t19 =1.72,p < 0.1) in the theta frequency range. Testing the absolute theta power, a comparable positive cluster was identified (Tsum = 17, 919; p<0.01) with the highest e↵ect size at 5.5 Hz and 0.92 s but with a slightly shifted topography that overlaps at the left anterior electrodes but additionally emphasizes the right temporal areas.
3.3.4 Source localization of alpha and theta power changes
With respect to scalp topography (Fig. 3.2 bottom row), alpha oscillations appeared to be distributed broadly over the scalp with a central focus and exhibited less power 32 Alpha and theta power dissociate in spoken word recognition
Figure 3.2: TFRs in sensor and source space. Top row shows the grand average TF power changes relative to a 500 ms pre-stimulus baseline over all electrodes for the three conditions separately: from left to right for clear pseudowords, ambiguous pseudowords, and real words. Black contours mark cluster boundaries. Middle row shows scalp topographies for relative power changes in the theta band (4.5 2.5 Hz, 500–1200 ms, corresponding to the time and frequency window used for the source localisation)± and below the source projection. Bottom row shows the same for relative alpha power changes (10 2 Hz, 1000 350 ms). Fourth column depicts statistical di↵erences. Fifth column are bargraphs± extracted from± source peaks in left IFG and right MTG for theta, and left VWFA and right aPFC for alpha, respectively. Results 33
Figure 3.3: Alpha–theta index. The index com- pares the theta e↵ect (Fig. 3.2 middle row) and the alpha e↵ect (Fig. 3.2 bottom row) per source space grid point. The index has been calculated for grid points only which exceeded the critical value of t19 =1.7291 such that only areas are highlighted which either show an alpha (blue) or theta (red) e↵ect. Areas with index values around zero (green) show equal sensitivity to both e↵ects, e.g., left frontal regions. with increasing wordness. Following from the single conditions’ source projections, source estimation of the alpha-driven wordness e↵ect revealed peak activation in BA 9, right dorsolateral prefrontal cortex (t19 =3.04; MNI = [10, 57, 40]). The cluster (Tsum = 1, 152.4; p<0.05) extended into the right primary somatosensory areas (BA 3), premotor cortex (BA 6), and motor cortex (BA 4), but also into the bilateral ventral and dorsal anterior cingulate cortex (BA 24/32), and the right inferior prefrontal gyrus (BA 47), including pars triangularis (BA 45). A second peak was found in the left occipital temporal cortex (t19 =2.88; MNI = [–50, –79, 0]) and extending into BA 37 (fusiform gyrus) and BA 20/21 (inferior and middle temporal gyrus). For theta power changes, the spreading of power change on the scalp (Fig. 3.2 middle row) suggested at least two generators: one with left frontal and one with right parietal origin, which had the highest relative power increase for ambiguous stimuli. Accordingly, two peak activations were found in one trend-level cluster (Tsum = 341.9; p =0.067) for the ambiguity e↵ect in the theta range. The first peak activation was found left anteriorly in BA 44 (pars opercularis; t19 =3.18; MNI = [–40, 19, 40]). It extends to BA 9/46, left dorsolateral prefrontal cortex, and BA 6, premotor cortex. The second local peak was found right posteriorly in the middle temporal gyrus (t19 =3.01; MNI = [60, –39, –2]), extending into inferior temporal gyrus (BA 20), fusiform gyrus (BA 37), supramarginal gyrus (BA 40), and posterior STG (BA 22).
3.3.5 Two separate networks disclosed by an alpha–theta index.
Calculating the alpha–theta index as shown in Figure 3.3 reveals that three of the four identified source peaks are selective for either the alpha-indexed lexical integration or the theta-indexed ambiguity resolution. Notably, the left IFG shows equally strong e↵ects of alpha and theta activities as indicated by index values around zero. 34 Alpha and theta power dissociate in spoken word recognition
3.4 Discussion
In order to functionally dissociate slow neural oscillations contributing to speech process- ing, we set up an auditory EEG study using a well-established lexical decision paradigm. Simultaneously, the data speak to theoretical controversies concerning spoken word recog- nition models (e.g., McClelland and Elman, 1986) by applying time–frequency analysis and revealing parallel processes of lexical integration and ambiguity resolution. Notably, alpha suppression, scaling with wordness and hence more akin to the N400, can be considered as a marker of ease in lexical integration, while theta enhancement marks the re-evaluation of the available sensory evidence. Generators of the alpha suppression e↵ect were part of a left temporo-occipital and right frontal network. Oppositely, generators of the theta e↵ect were localized in the left frontal and right middle temporal regions. As we discuss below in further detail, the analysis of di↵erent oscillatory frequency bands disclosed the parallel maintenance of lexical and prelexical word versus pseudoword fea- tures in di↵erent brain regions and frequency ranges. To this end, time–frequency analysis is an important tool to inform discussions on sequential versus parallel processes in word recognition (e.g., Marslen-Wilson, 1987; for discussion see Norris et al., 2000).
3.4.1 Wordness e↵ect in the alpha band
In line with previous findings (Obleser et al., 2012), alpha power showed the greatest suppression for real words compared to the lowest suppression (or even enhancement) for clear pseudowords. Interestingly, ambiguity leads to sub-optimal lexical integration (Friedrich et al., 2006; Proverbio and Adorni, 2008) and seems to be expressed in a state of intermediate alpha power. Two (related) theoretical framings are relevant for this e↵ect of wordness observed in the alpha frequency range. On the one hand, it has been emphasized that parieto-occipital alpha power reflects an inhibitory mechanism, with particular relevance for working mem- ory and selective attention tasks (Klimesch et al., 2007; Foxe et al., 1998). On the other hand, recent findings provide more direct evidence for an influence of alpha oscillations on the timing of neural processing: Haegens et al. (2011) could show that better discrimina- tion performance can be traced back to neuronal spiking in sensorimotor regions, which depends on the alpha rhythm not only in terms of power (firing is highest during alpha suppression) but also in terms of phase (firing is highest at the trough of a cycle; see also Spaak et al., 2012). Supporting the view put forward by Hanslmayr et al. (2012; high alpha oscillatory power mirroring reduced Shannon entropy and flow of information), Haegens et al. (2011) also found low spike-firing rates during periods of strong alpha coherence, for example, during the baseline, as opposed to the stimulus period. Both frameworks converge on predicting that low alpha power can serve as a marker of successful lexical integration. Discussion 35
An open issue is the potential contribution of the visual “what”-pathway to the alpha e↵ect observed here. Particularly the left temporo-occipital source localisation peak suggests involvement of visual fields. This might be due to the fact that we used concrete nouns, which are by definition easily imaginable in comparison to the less imaginable pseudowords (for review see Binder et al., 2009). Note, however, that in a previous fMRI study using highly similar manipulations (Raettig and Kotz, 2008), no such e↵ects even in the contrast of concrete versus abstract nouns were found. Nevertheless, the visual word form area has been found in auditory lexical decision tasks before and has been attributed to the literacy of participants (Dehaene et al., 2010; Dehaene and Cohen, 2011). Binder et al. (2006) gathered evidence that this area is especially sensitive to sublexical bigram frequency—a pivotal element of our study design. The argument of suppressed alpha power allowing lexical integration laid out above would also hold for such a traditionally more reading- related brain area.
3.4.2 Ambiguity e↵ect in the theta band
Contrary to a previous study by Obleser et al. (2012), theta power did not scale linearly with di culty of word processing (if defined as di culty of lexical access). In particular, Obleser et al. (2012) found higher theta power for higher intelligibility, whereas in our case theta power was highest for the ambiguous (i.e. the most di cult) case. The data provided by Obleser et al. (2012) suggest that su cient spectral information is needed to enable linguistic processes or lexical evaluation, which is reflected in increasing theta power. Our data extend this view by adding ambiguity on a lexical level which requires additional lexical re-evaluation. Future research needs to clarify whether these two factors, spectral detail and lexical ambiguity, might interact. Nevertheless, both results together support our interpretation of theta oscillations subserving a language-related but task- dependent mechanism and are in line with previous studies associating theta enhancement with lexico-semantic processing (Hagoort et al., 2004; Hald et al., 2006; Bastiaansen et al., 2008; Pe˜na and Melloni, 2012). Interestingly, a recent opinion paper by Roux and Uhlhaas (2014) suggests that theta oscillations may be involved in the phonological loop (Baddeley, 2003). The link to the phonological loop as a concept of linguistic short-term memory speaks in favor of our inter- pretation where lexical re-evaluation is achieved by replay of sensory evidence (Fuentemilla et al., 2010). Furthermore, increased prefrontal theta power has been found in response to other types of ambiguous stimuli as well, and therefore might not be tied to the language domain. Specifically, increased mid-frontal theta activity has been reported in studies investigating the ambiguity induced response conflict (Hanslmayr et al., 2008; Cavanagh et al., 2009; Cohen and van Gaal, 2013) and episodic memory retrieval (Staudigl et al., 2010; Ferreira et al., 2014). Although these studies di↵er markedly with regard to several aspects, they 36 Alpha and theta power dissociate in spoken word recognition all share the need for processing an ambiguous stimulus. It thus appears possible that enhanced theta oscillations during ambiguous word processing reflects enhanced conflict monitoring due to the co-activated real word (‘Banene’ co-activates ‘Banane’). We localised the enhanced theta activity in a bilateral fronto-temporal network with peak activity in the left inferior frontal gyrus (IFG, BA 44) and the right middle temporal gyrus (MTG). Their contributions, though, to the proposed interpretation of replay need to remain speculative. Instructively, a right hemispheric advantage in tracking spectral information has been shown (Zatorre and Belin, 2001; Obleser et al., 2012; Scott et al., 2009; for review see Price, 2012) which converges with the fact that vowel di↵erences (our crucial manipulation) are primarily spectral di↵erences. More specifically, Carreiras and Price (2008) found in accordance with our results increased activation of right hemi- spheric areas when manipulating vowels. Combining both ideas, Zaehle et al. (2008) could show that the analysis of prelexical segments with respect to their spectral characteristics involved bilateral MTG activation. The left IFG, however, has been associated with a variety of linguistic processes (see Binder et al., 2009 for a meta-analysis). The unfortunate vagueness of EEG source localisation limits functional dissociations which have been assigned to di↵erent subregions of the left IFG. Still, left IFG as a whole plays a role when monitoring auditory input (e.g., Zatorre et al., 1996; Giraud et al., 2004; Obleser et al., 2012). Other terms such as “auditory search”, “auditory attention”, or “auditory short-term memory” have been used to describe this function. This speaks in favor of our interpretation of auditory re-evaluation. One might argue that our task was too easy to require top-down or re-evaluative processes. This relates to the ongoing psycholinguistic discussion whether replay or any feedback loop is really necessary in word recognition (Norris et al., 2000; McClelland et al., 2006). Since our stimuli were not phonetically ambiguous (see section 3.2.2), no perceptual confusion oc- curred which would have required replay (Ganong, 1980; Frauenfelder et al., 1990; Wurm and Samuel, 1997; Newman et al., 1997; Norris, 2006). However, stimuli were lexically ambiguous which led to decisional conflicts and required ambiguity resolution processes. Recall that we introduced manipulations not before the second syllable. The third (and final) syllable, however, either continued the wordness violation (clear pseudoword) or created a lexically ambiguous case by resuming to the initially pre-activated cohort (am- biguity). Mattys (1997) summarizes evidence that retrograde information, i.e. provided after the deviation point, can influence the decision on the identity of a stimulus. This may increase reaction times (Goodman and Huttenlocher, 1988; Taft and Hambly, 1986), implying some re-evaluative processes. We therefore suggest that prelexical information were maintained and replayed in order to resolve decisional ambiguity. In sum, we argue for a theta-tuned network which is co-activating the left IFG and the right MTG in order to replay lexico-semantic information for task-relevant ambiguity resolution. Discussion 37
3.4.3 Relationship of evoked potentials and induced oscillations
So far, studies analysing the ERP have related the N400 to e↵ortful processing, for example when mapping the phonological form and meaning of pseudowords, compared to real words, onto a stored representation in the mental lexicon (Friedrich et al., 2006). Recent accounts more rooted in the predictive coding framework of cortical functional organization (e.g., Summerfield and Egner, 2009) may describe the N400 as a marker of the mismatch between what is predicted and what is perceived (Lau et al., 2009, 2013). While we cannot distinguish between these explanations in a context-free setup using single word stimuli, our data more importantly show parallels in the pattern of the N400 changes over midline electrodes and the pattern of alpha oscillatory changes. Contrary to the e↵ort- or predictive coding-hypothesis, the inhibition theory for alpha oscillations would then imply that lexical processing takes place for real words, and must be inhibited for pseudowords. Notably, only analysing the ERP would have led to the view that lexico-semantic integra- tion in ambiguous pseudowords can be accomplished in the same way as their real word analogs. The regrouping of N400 deflections over time would have suggested a sequential change in processing strategy: First, ambiguous stimuli were analysed in the same way as proper pseudowords, but from 850 ms onwards no di↵erence between ambiguous and real word stimuli was discernible. Thus, the conclusion derived from ERPs only would have been a sequential process of lexical access. Such time–frequency decompositions of the ERP as demonstrated here may help in the future to resolve inconsistencies in the N400 literature and its generating brain structures (Halgren et al., 2002; Khateb et al., 2010). By looking additionally at oscillatory activity, which arguably constitutes the ERP activ- ity to large extents (Makeig et al., 2004; Mazaheri and Jensen, 2008; Min et al., 2007; Hanslmayr et al., 2007; Klimesch et al., 2007), parallel neural processes become dis- cernible. The data suggest a combination of lexical integration and ambiguity resolution processes: wordness violations are detected (N400) and maintained (alpha power), but also re-evaluated retrieving stimulus-specific information (i.e., enhanced power of theta oscillations for ambiguous stimuli).
3.4.4 Conclusion
Time–frequency decomposition functionally separates parallel contributions of theta and alpha oscillations to speech processing, thereby fruitfully extending current frameworks based on evoked potentials. The data presented here provide evidence that lexical as well as prelexical information are maintained in spoken word recognition. The observed specificity, with theta bearing relevance to stimulus-specific, lexico-semantic processes and alpha reflecting more general inhibitory processes (thereby gating lexical integration), is a promising starting point for future studies on speech comprehension in more demanding circumstances such as peripheral hearing loss and/or noisy environments. The data fur- 38 Alpha and theta power dissociate in spoken word recognition thermore shed light onto the neural bases of the lexical decision task that has been in use for decades. In sum, this approach allows for a refinement of neural models describing the complex nature of spoken word recognition. 4 Alpha oscillations as a tool for auditory selective inhibition3
4.1 Introduction
In ecological listening situations, auditory signals are rarely perceived in quiet due to the presence of di↵erent auditory maskers such as distracting background speech or environ- mental noise. Thus, sounds from di↵erent sources greatly overlap spectro-temporally at the level of the listener’s ear. What are the neural correlates that facilitate selective lis- tening to relevant target signals despite irrelevant auditory input (i.e., the “cocktail party problem”; Cherry, 1953)? At the central neural level, two complementary mechanisms of top-down control (i.e., regulation of subsidiary cognitive processes) should be considered: First, top-down selective attention to relevant information (Fritz et al., 2007) could facili- tate target processing by enhancing the neural response to the attended stream (i.e., gain control; Lee et al., 2014). Second, top-down selective inhibition of maskers (Melara et al., 2002) could help to direct limited processing capacities away from irrelevant information (Desimone and Duncan, 1995), thereby avoiding full processing of distractors (Foxe and Snyder, 2011). In this regard, interference of auditory maskers might be the result of both insu cient attention to the target and poor inhibition of noise and distractors. In this perspective article we focus on the latter, that is, neural mechanisms of auditory selective inhibition. We propose that cortical alpha ( 10 Hz) oscillations are an important tool for top-down ⇠ control as they regulate the inhibition of masker information during speech processing in challenging listening situations.
4.2 A framework to test auditory alpha inhibition
A common observation is a prominent increase in alpha power when participants listen to auditory materials presented against background noise (e.g., Wilsch et al., 2014). Figure 4.1A, for example, shows the grand average time–frequency representations of 11 partic- ipants during a lexical decision task on isolated words presented in quiet (data reported in Chapter 3) and in white noise. For words in quiet, alpha power at around 10 Hz did
3This chapter is adapted from parts of the published article by Strauß, W¨ostmann, and Obleser (2014). Front Hum Neurosci 8,350.
39 40 Alpha oscillations as a tool for auditory selective inhibition
Figure 4.1: The proposed role of alpha activity for speech processing in noise. A. Average time- frequency representations (TFR) of 11 participants performing a lexical decision task on words in quiet (top) and in white noise (bottom). SNRs were titrated individually using a two-down- one-up staircase adaptive tracking procedure. Average SNR was -10.22 dB 1.95 (SD) such that participants performed about 71 % correct. Speech onset is indicated by the± black vertical line at 0 s; average word length = 750 ms; EEG recorded from 61 scalp electrodes; time-frequency analysis using Morlet wavelets. Plots show measures of absolute power averaged over all scalp electrodes. Topography depicts the alpha power di↵erence for speech in noise – quiet. Data were SCD (source current density)-transformed before TFR estimation to improve spatial resolution. B. Inter-trial phase coherence (ITPC) as a measure of phase-locking of oscillations over trials. ITPC is bound between 0 and 1; higher ITPC values indicate stronger phase alignment across trials. C. A simple framework of alpha oscillations for speech processing in noise. Acoustic signals overlap energetically as they enter the ear. At the brain level, features of speech and noise are processed as far as possible in distinct processing channels (depicted here with arrows; for details see text). High alpha power inhibits channels processing noise features to allow for an optimal task performance with minimised noise interference. not considerably increase after word onset. However, when words were presented in noise, alpha power was increased during the first 500 milliseconds after word onset corresponding to the first two thirds of the average word duration. This e↵ect was strongest over tem- poral and occipital sites (topography in Fig. 4.1A) suggesting the inhibition of the task irrelevant visual modality but also compensatory mechanisms within speech-related areas. Critically, alpha power di↵erence did not depend on ITPC (inter-trial phase coherence) di↵erences, as indicated by the absence of a stronger ITPC in noise compared to quiet (Fig. 4.1B). We therefore presume that induced (i.e., not strictly stimulus-locked; Freunberger et al., 2009) alpha power is crucial for speech processing in challenging listening conditions as it suppresses irrelevant information. Figure 4.1C illustrates a tentative framework for how alpha oscillations could support auditory selective inhibition. Sounds arriving at the listener’s ear must be further processed in the brain to extract task-relevant information. One way to think about the proposed A short review of auditory alpha inhibition 41 mechanism is in terms of auditory object selection which requires object formation in the first place (Shinn-Cunningham, 2008). An auditory object might be formed on the basis of common spectro-temporal features, harmonicity, simultaneous onsets, or spatial grouping (Gri ths and Warren, 2004; Bizley and Cohen, 2013). We refer to all these di↵erent features used to form auditory objects as “channels” of auditory information represented by the arrows in Fig. 4.1C. The concept of channels has a long tradition (Broadbent, 1958) and is inspired by the most clear distinction of target and distractor used in many dichotic listening paradigms where left and right ear channel need to be separated. Nevertheless, channels in our framework should be conceived as functional auditory processing units rather than anatomical pathways. As soon as these channels are defined, attention or inhibition can be selectively applied, given attentionally flexible fields in the auditory cortices (Petkov et al., 2004). Note that even though in the visual modality claims about alpha oscillations in feature-based (Romei et al., 2012) and object-based (Kinsey et al., 2011) attention have been made, we do not make any assumption about this distinction in our framework and use the term “channels” for both features and objects, or early and late selection. If speech is presented in quiet (Fig. 4.1C, top panel), alpha power is low in channels processing features of the speech signal to support processing of task-relevant information. Accordingly, the net resulting alpha power in the M/EEG would continue on baseline level (Fig. 4.1A) and decrease during word integration (>400 ms). If, however, speech is pre- sented in the presence of maskers (e.g., environmental noise, distracting talkers; Fig. 4.1C, bottom panel), alpha power needs to be up-regulated first in those channels processing noise features before it is going to be suppressed during word integration (Fig. 4.1A). En- hanced alpha activity inhibits processing of noise and thereby “protects” (Klimesch, 1999; Roux and Uhlhaas, 2014) the task- or performance-relevant information in the speech signal from noise interference. Importantly, the up-regulation of alpha power in channels that process noise is not an automatic (“bottom-up”) process but critically depends on “top-down” attentional con- trol. For instance, in a multi-talker situation, target and distracting talker switch roles permanently, as the listener decides to change the conversational partner. In such a situ- ation, M/EEG alpha power would be constantly at a high level; however, the deployment of alpha power onto the di↵erent processing channels would be changing continuously.
4.3 A short review of auditory alpha inhibition
What is the functional role of high alpha activity for word processing in noise? To answer this question, it is essential to distinguish between interpretations in which alpha activity is related to target processing from these related to noise processing. It is possible that the reduced intelligibility of words in noise leads to sub-optimal word processing and thus to 42 Alpha oscillations as a tool for auditory selective inhibition less alpha suppression in brain areas relevant for speech processing (Strauß et al., 2014). The inverse mechanism, as we put forward in the current framework, is equally likely by which alpha power is enhanced for temporarily irrelevant information and thereby com- pensates for perceived cognitive e↵ort (increased when listening to speech in noise: Larsby et al., 2005; Helfer et al., 2010; Zekveld et al., 2011). In this regard, alpha would “pro- tect” the lexical processes from noise interference. The challenge will be to experimentally dissect these (not mutually exclusive) mechanisms. We now review initial evidence for alpha’s inhibitory role in audition. Currently, there are only few studies that show alpha power modulations when participants simultaneously listen to two auditory streams, that is, one signal and one masker. In one study by Kerlin and colleagues (2010), participants were simultaneously listening to two spatially separated speech streams. On each trial, an initial visual cue indicated whether they were supposed to attend the left or right stream. During speech presentation, EEG alpha power was enhanced over the cerebral hemisphere contralateral to the masker, while alpha power was reduced contralateral to the to-be-attended stream. The authors concluded that this alpha lateralization indexes the direction of auditory attention to speech in space. Importantly, this finding corroborates our view that enhanced alpha power in brain areas engaged in distractor processing decreases further processing of the distractor and hence, facilitates processing of the target signal. However, two questions arise from this study: First, as the direction of auditory attention was cued visually in this study, it might be that the alpha lateralization indicates the allocation of supramodal rather than auditory selective attention (Farah et al., 1989). Second, spatial attention may play a special role not least because of auditory processing models suggesting separate what- and where-pathways (Rauschecker and Scott, 2009). In three other recent studies, alpha power modulations were consistently found during the anticipation of auditory target signals from the left or right (M¨uller and Weisz, 2012; Banerjee et al., 2011; Ahveninen et al., 2013). In these studies, participants were cued to attend either the auditory event on the left or right, and to ignore the distractor on the other side. Alpha power was enhanced during the anticipation of auditory stimulation contralateral to the distractor. These results demonstrate alpha lateralization e↵ects al- ready during the preparation for an auditory selective listening task. This is in line with studies reporting high pre-stimulus alpha power when participants are about to miss a (visual) target (van Dijk et al., 2008; Busch et al., 2009; Romei et al., 2010). In terms of our framework (Fig. 4.1C), anticipatory high alpha power successfully blocks in-depth processing of sensory information that might lead to missing the target. However, interpretations of these studies are limited for our model, since alpha power modulations were found only during the anticipation but not during the actual processing of competing auditory streams. More data are clearly needed on the peri-stimulus alpha dynamics. As the spatial resolution of M/EEG is limited, prospective experiments could Conclusion 43 induce alpha oscillations over specific brain areas using transcranial alternating current stimulation (tACS) to assess the influence of alpha modulations on listening success under adverse acoustic conditions. Moreover, future studies could record the electrocorticogram (ECoG) directly from the cortical surface to track alpha sources and reveal the interplay between frequency bands. Such higher spatial resolution would allow to di↵erentiate be- tween alpha activity in brain regions associated with processing the masker or the signal. As of now, we are left to speculate how spatially specific alpha oscillations might operate, for example along a cochleotopic gradient in primary auditory cortex. The best data to infer from stems from visual cortex, where for example Bu↵alo and colleagues recorded with two electrode tips in attended vs non-attended receptive fields less than a millime- ter apart and report attention-dependent opposing, and deep-layer-specific alpha changes (expressed as alpha spike–field coherence; Bu↵alo et al., 2011). Comparable data are, to our knowledge, still missing for auditory areas.
4.4 Conclusion
We have presented a framework for studying alpha oscillations as a tool for auditory se- lective inhibition in challenging listening situations. The data provide initial evidence qualifying alpha oscillations as a pivotal mechanism a↵ecting listening in multi-talker situ- ations. Future studies could expand these findings and study the role of alpha oscillations during speech perception in ecologically valid listening situations.
5 Alpha phase determines successful lexical decision in noise4
5.1 Introduction
Human psychophysical performance for detection and discrimination of low-level stimuli has been found to depend on slow pre-stimulus oscillatory brain states across domains (visual: Varela et al., 1981; Hanslmayr et al., 2007; van Dijk et al., 2008; Busch et al., 2009; Schubert et al., 2009; Cravo et al., 2013; Spaak et al., 2014; auditory: Lakatos et al., 2005; Henry and Obleser, 2012; audiovisual: Keil et al., 2014). These findings relate neural phase to neural excitability fluctuations, such that performance is best for targets coinciding with the excitable phase of a neural oscillation, and worst for targets coinciding with the inhibitory phase. Going beyond low-level perception, we ask here whether higher cognitive functions such as speech processing would also depend on neural phase. Although recently proposed models would predict a dependence of speech processing on neural oscillatory phase (Ghitza, 2011; Gagnepain et al., 2012; Giraud and Poeppel, 2012), no experimental evidence has been gathered so far. One elegant task that can bridge psychophysical aspects of performance (detection or dis- crimination) with speech processing is the auditory lexical decision task (Marslen-Wilson, 1980): Listeners are presented with words as well as word-like stimuli (i.e., pseudowords), and have to judge whether they heard a meaningful word or not. Parallel to low-level discrimination studies, we made the lexical decision task “near-threshold” by embed- ding speech in individually titrated levels of white noise, which increased the di culty of the task and, purposefully, the amount of errors. We simultaneously recorded the electroencephalogram and hypothesized that a dependence of lexical-decision accuracy on low-frequency neural oscillatory phase should be observed. Here, we were interested in the role of alpha (8–12 Hz) and theta (3–7 Hz) neural phase for lexical decision performance. Instantaneous alpha phase has previously been linked to low-level detection and discrimination performance not only in the visual (Mathewson et al., 2009; Busch and VanRullen, 2010; Romei et al., 2010), but also in the auditory domain (Rice and Hagstrom, 1989; Neuling et al., 2012). Critically, alpha phase has been
4This chapter is adapted from a manuscript by Strauß, Henry, Scharinger, and Obleser (sion). The Journal of Neuroscience.
45 46 Alpha phase determines successful lexical decision in noise found to modulate neuronal firing and to determine the neural phase associated with best discrimination performance (Haegens et al., 2011). Discrimination performance in lexical decision may also depend on syllabic processing and thus potentially be indexed by oscillatory activity in the theta range ( 4 Hz) with oscillation periods corresponding to ⇠ the average syllable duration of around 250 ms (Ng et al., 2012; Peelle and Davis, 2012; Gross et al., 2013; Doelling et al., 2014; note that also Busch et al., 2009, reported a pre- stimulus phase bifurcation e↵ect in the 7-Hz range). Similar to alpha, theta oscillations have been linked to neuronal firing (e.g., Kayser et al., 2012) and can impact auditory detection performance (Ng et al., 2013). Our data show that the accuracy of auditory lexical decision depends on the instantaneous phase of alpha oscillations: Stimuli that were later judged incorrectly fell into an alpha phase opposite to that for stimuli that were judged correctly in a pre-stimulus time window as well as in a second, peri-stimulus time window.
5.2 Methods
5.2.1 Participants
Eleven participants (7 females; 25.1 1.6 years, M SD) gave informed consent to ± ± take part in the experiment. All were native speakers of German, right-handed, with self-reported normal hearing abilities, and no history of neurological or language-related problems. They received financial compensation for their participation. All procedures had ethical approval from the Ethics Committee of the University of Leipzig.
5.2.2 Stimuli
Stimuli were real words and their pseudoword counterparts (Raettig and Kotz, 2008; Strauß et al., 2014). Pseudowords were created as follows: From a list of 60 tri-syllabic concrete German nouns (‘real’ words, e.g., /banane/, [engl. banana]) two types of pseudowords were derived, ‘ambiguous’ pseudowords, by exchanging only the nucleus vowel of the sec- ond syllable (e.g., /banene/), and ‘opaque’ pseudowords by scrambling the syllables across words while keeping the position-in-word fixed (e.g., /bapossner/). Furthermore, 60 ‘ab- stract’ real words (e.g., /botanik/, [engl. botany]) served as fillers to ensure a balanced word–pseudoword ratio. In sum, the experimental corpus consisted of 240 lexical stimuli with a mean length of 754.2 83.5 ms (M SD). In the following, opaque pseudowords ± ± and abstract real words were not analyzed because we focused on the noise-induced vowel confusion between real words and ambiguous pseudowords that lead to “Yes” or “No tex- tquotedblright decisions about whether an item was a word or not. All words and pseudowords were spoken by a trained female speaker and digitized at 44.1 kHz. Post-processing included down-sampling to 22.05 kHz, cutting at zero crossings closest to articulation on- and o↵sets, and root mean square (RMS) normalization. Methods 47
5.2.3 Experimental procedure
Prior to each experimental EEG session, individual signal-to-noise ratios (SNR) were de- termined by means of an adaptive tracking procedure. During adaptive tracking, partic- ipants were presented with the second syllables extracted from the real words and their ambiguous-pseudoword counterparts. On each trial, the participant heard two successive syllables embedded in white noise and indicated whether the vowels in each pair were “same” or “di↵erent”. Intensity of the white noise was adjusted according to a two-down- one-up staircase procedure that estimated the signal-to-noise ratio (SNR) targeting 70.7% accuracy (Levitt, 1971). Resulting average SNR was –10.22 1.95 dB (M SD). ± ± Next, a short familiarization for the trial timing was provided during which participants made lexical decisions in noise about 10 additional items from Raettig and Kotz (2008) that were not used in the present experiment. During the EEG experiment, participants heard words and pseudowords embedded in white noise and indicated via button press whether they heard a real word or not (“Yes”/“No”). Button order (left/right for “Yes”/“No” responses) was counterbalanced across participants. On each trial, the white noise started 1 sec before (pseudo)word onset, coincident with the appearance of a fixation cross, and lasted for 2.2 sec in total (see Fig. 5.1A). After 2.2 sec, the fixation cross changed to a question mark that prompted the lexical decision response. Trial timing was chosen based on a previous study in our lab using the same paradigm without noise (Strauß et al., 2014) and allowed artifact-free estimations of time–frequency representations (see Data analysis). Each participant listened to 240 stimuli (120 words, 120 pseudowords) in an individually pseudo-randomized sequence. That is, each participant heard both the ‘real’ and the ‘ambiguous’ versions of each word. The order of occurrence for a given real word and its pseudoword counterpart was counterbalanced across participants in order to control for potential interfering e↵ects of previous exposure to the respective complementary item. For the same reason, the distance between a word and its pseudoword counterpart was maximized (i.e., on average 120 other items in-between). Listeners paused after each block of 60 trials. Overall duration of the experimental procedure was about 30 minutes.
5.2.4 Data acquisition and preprocessing
The electroencephalogram (EEG) was recorded from 64 Ag–AgCl electrodes positioned according to the extended 10–20 standard system on an elastic cap with a ground electrode mounted on the sternum. Bipolar horizontal and vertical electrooculograms (EOG) were recorded for ocular artifact-rejection purposes. All impedances were kept below 5 k⌦. Signals were referenced online against the left mastoid, and digitized with a sampling rate of 500 Hz and a passband of DC to 140 Hz. Individual electrode positions were determined after EEG recording with the Polhemus FASTRAK electromagnetic motion tracker. 48 Alpha phase determines successful lexical decision in noise
Figure 5.1: Trial design and behavioural measures. A. Trial design. Lexical stimuli were presented against a white-noise background. The distribution of critical vowel onsets is shown schematically in relation to the timing of the two alpha phase e↵ects reported here. Average word length was 0.74 0.08 s (M SD). Delayed lexical decision was prompted by a question mark. B. Analysis scheme.± 70% correct± was targeted with individual signal-to-noise ratios. For the analysis, correct trials comprised trials on which participants responded “Yes” to a real word or “No” to the ambiguous counterpart as illustrated by the cross-tabulation. C. Behavioural results. Participants performed better for real words than for ambiguous pseudowords. However, performance for both stimulus types was significantly above chance.
EEG preprocessing was done o✏ine using the open-source Fieldtrip toolbox (Oostenveld et al., 2011) for Matlab (Mathworks). To avoid edge e↵ects at low frequencies, broad epochs were defined ranging between –700 ms (excluding ERPs due to noise onset) and 2100 ms relative to (pseudo)word onset. Data were band-pass filtered from 0.1 Hz to 100 Hz using a two-pass Butterworth filter and, for ERP analysis only, re-referenced to combined mastoids (time–frequency analyses, see below, used re-referencing to average reference). To reject systematic artifacts, independent component analysis (ICA) was applied and com- ponents comprising eye movement, heartbeat, and muscle artifacts were rejected according to definitions provided by Debener et al. (2010). After ICA, an automatic artifact-rejection routine removed single trials for which within-channel peak-to-peak range exceeded 120 µV. On average, 2.7 3.0 (M SD) trials were rejected per participant. The resulting ± ± clean data were used for subsequent data analyses.
5.2.5 Data analyses
Phase analysis. Time– frequency representations (TFRs) were estimated from single-trial data so that we could assess the e↵ects of phase and power on lexical decisions. Epoched, filtered, artifact-rejected time-domain data were re-referenced to average reference (Strauß et al., 2014). Subsequently, Morlet wavelets were applied on single-trial TFRs in 20-ms steps with a frequency-specific window width to account for the trade-o↵ between higher frequency resolution for lower frequencies and higher time resolution for higher frequencies. Therefore, TFRs for logarithmically spaced frequencies from 3 to 30 Hz were convolved with linearly increasing window widths ranging from 2 to 12 cycles. Phase and power Methods 49 values were then estimated at each channel frequency time point from the complex ⇥ ⇥ output of the wavelet convolution. For the analysis of phase data, we calculated a phase bifurcation index (BI), , suggested by Busch et al. (2009). First, trials were split based on accuracy (i.e., correct versus incorrect responses) for each participant. Then, we calculated inter-trial phase coherence (0 ITPC 1) separately for correct trials, for incorrect trials, and for all trials taken together. Lastly, to compute the phase bifurcation index , the ITPC for correct, incorrect, and all trials were combined according to the following formula: