Cross-modal mechanisms: perceptual

multistability in audition and vision

von der Fakultät für Naturwissenschaften der Technischen Universität Chemnitz

genehmigte Dissertation zur Erlangung des akademischen Grades

doctor rerum naturalium

(Dr. rer. nat.)

vorgelegt von M.Sc. Jan Grenzebach geboren am 06.07.1988 in Rotenburg an der Fulda eingereicht am 13.01.2021

Gutachter: Prof. Dr. rer. nat. habil. Alexandra Bendixen Prof. Dr. sc. nat. Wolfgang Einhäuser-Treyer

Tag der Verteidigung: 04.05.2021 Veröffentlicht unter: https://nbn-resolving.org/urn:nbn:de:bsz:ch1-qucosa2-749013

1

BIBLIOGRAPHISCHE BESCHREIBUNG Autor: Grenzebach, Jan Titel: “Cross-modal mechanisms: perceptual multistability in audition and vision” Dissertation an der Fakultät für Naturwissenschaften der Technischen Universität Chemnitz, Institut für Physik, Dissertation, 2021 189 Seiten; 46 Abbildungen; 5 Tabellen; 297 Literaturzitate; 13 digitale Medien (online)

Referat: Perceptual multistability is a phenomenon that is mostly studied in all modalities separately. The phenomenon reveals fundamental principles of the perceptual system in the formation of an emerging cognitive representation in the consciousness. The momentary perceptual organizations evoked during the stimulation with ambiguous stimuli switches between several perceptual organizations or percepts: The auditory streaming stimulus in audition and the moving plaids stimulus in vision, elicit different at least two percepts that dominate awareness exclusively for a random phase or dominance duration before an inevitable switch to another percept occurs. The similarity in the perceptual experience has led to propose a global mechanism contributing to the perceptual multistability phenomena crossmodally. Contrary, the difference in the perceptual experience has led to propose a distributed mechanism that is modality-specific. The development of a hybrid model has synergized both approaches. We accumulate empirical evidence for the contribution of a global mechanism, albeit distributed mechanisms play an indispensable role in this cross-modal interplay. The overt report of the perceptual experience in our experiments is accompanied by the recording of objective, cognitive markers of the consciousness: Reflexive movements of the eyes, namely the dilation of the and the optokinetic nystagmus, correlate with the unobservable perceptual switches and perceptual states respectively and have their neuronal rooting in the brainstem. We complement earlier findings on the sensitivity of the pupil to visual multistability: It was shown in two independent experiments that the pupil dilates at the time of reported perceptual switches in auditory multistability. A control condition on confounding effects from the reporting process confines the results. Endogenous, evoked internally by the unchanged stimulus ambiguity, and exogenous, evoked externally by the changes in the physical properties of the stimulus, perceptual switches could be discriminated based on the maximal amplitude of the dilation. The effect of exogenous perceptual has on the pupil were captured in a report and no-report task to detect confounding perceptual effects. In two additional studies, the moment-by-moment coupling and coupling properties of percepts between concurrent multistable processes in audition, evoked by auditory streaming, and in vision, evoked by moving plaids, were found crossmodally. In the last study, the externally induced percept in the visual multistable process was not relayed to the simultaneous auditory multistable process: Still, the observed general coupling is fragile but existent. The requirement for the investigation of a moment-by-moment coupling of the multistable perceptual processes was the application of a no-report paradigm in vision: The visual stimulus evokes an optokinetic nystagmus that has machine learnable different properties when following either of the two percepts. In combination with the manually reported auditory percept, attentional bottlenecks due to a parallel report were circumvented. The two main findings, the dilation of the pupil along reported auditory perceptual switches and the crossmodal coupling of percepts in bimodal audiovisual multistability, speak in favor of a partly global mechanism being involved in control of perceptual multistability; the global mechanism is incarcerated by the, partly independent, distributed competition of percepts on modality level. Potentially, supramodal attention-related modulations consolidate the outcome of locally distributed perceptual competition in all modalities. Schlagworte: perception, multistability, multimodal, pupil dilation, optokinetic nystagmus, hybrid mechanism, auditory streaming, moving plaids, no-report, support vector machine

2

ACKNOWLEDGEMENTS

Foremost, I want to thank both my supervisors Prof. Dr. Alexandra Bendixen and Prof. Dr. Wolfgang Einhäuser-Treyer equally, for their enduring trust and guidance, granting me in their feedback the opportunity to develop new and improve skills and gain insights into their research methodology. In the modern scientific environment, they allowed me access to, I was able to unfold, learn, and contribute to open research questions in human perception. Thank you for providing me with what I could not ask for.

For his knowledgeable technical support and tireless effort keeping the excellent state of the laboratory, Thomas Baumann has my gratitude. My project partner and co-author in several studies, Thomas Wegner, must be thanked for his well-commented code. The gathered empirical data depended on the thoughtful instruction of participants in our Experiments, which was executed superbly by Christiane Breitkreutz, Susann Lerche, and Fabian Parth. I appreciate our generous project funding from the German Research Foundation (DFG). For comments and discussion on earlier drafts of this Thesis I thank Dr. Bahareh Kiani and Caroline Hübner. Lastly, Dr. Karl Kopiske and Dr. Jan Drewes have been a rich source of know-how in practical statistical inference and the implementation of machine-learning algorithms, respectively.

The last three years would have been a less enjoyable journey without family, friends, and colleagues near-and-far: my sisters Lisa and Hannah, Alexander Jarka, Marius Bartholmai, Felix Beyer, Sophie Metz, Monique Michl, Daniel Walper, Daniel Backhaus, Max Theisen, Christiane Neubert and Tobias Müller… Thank you!

Chemnitz, 13th January, 2021, Jan Grenzebach

3

CONTENTS

COVER ...... 1 BIBLIOGRAPHISCHE BESCHREIBUNG ...... 2 ACKNOWLEDGEMENTS ...... 3 CONTENTS ...... 4

CHAPTER 1: Introduction ...... 6 C1.1: Stability and uncertainty in perception ...... 6 C1.2: Auditory, visual and audio-visual multistability ...... 14 C1.3: Capturing the subjective perceptual experience ...... 25 C1.4: Limitations of preceding studies, objectives, and outline of the Thesis...... 33

CHAPTER 2: Study 1 “Pupillometry in auditory multistability” ...... 36 C2.1.1 Experiment 1: Introduction ...... 36 C2.1.2 Experiment 1: Material and Methods ...... 38 C2.1.3 Experiment 1: Data analysis ...... 44 C2.1.4 Experiment 1: Results ...... 48 C2.1.5 Experiment 1: Discussion ...... 52 C2.2.1 Experiment 2: Introduction ...... 54 C2.2.2 Experiment 2: Material and Methods ...... 54 C2.2.3 Experiment 2: Data analysis ...... 56 C2.2.4 Experiment 2: Results ...... 57 C2.3 Experiment 1 & 2: Discussion ...... 61 C2.4 Supplement Study 1 ...... 65

CHAPTER 3: Study 2 “Multimodal moment-by-moment coupling in perceptual bistability” . . 71 C3.1.1 Experiment 1: Introduction ...... 71 C3.1.2 Experiment 1: Results ...... 74 C3.1.3 Experiment 1: Discussion ...... 80 C3.1.4 Experiment 1: Material and Methods ...... 84 C3.1.5 Experiment 1: Data analysis ...... 87 C3.2 Supplement Study 2 ...... 92

4

CHAPTER 4: Study 3 “Boundaries of bimodal coupling in perceptual bistability” ...... 93 C4.1.1 Experiment 1: Introduction ...... 93 C4.1.2 Experiment 1: Material and Methods ...... 98 C4.1.3 Experiment 1: Data analysis ...... 102 C4.1.4 Experiment 1: Results ...... 108 C4.1.5 Experiment 1: Discussion ...... 114 C4.2.1 Experiment 2: Introduction ...... 116 C4.2.2 Experiment 2: Material and Methods ...... 119 C4.2.3 Experiment 2: Data analysis ...... 125 C4.2.4 Experiment 2: Results ...... 133 C4.3 Experiment 1 & 2: Discussion ...... 144 C4.4 Supplement Study 3 ...... 151

CHAPTER 5: General Discussion...... 154 C5.1 Significance for models of multistability and implications for the perceptual architecture 162 C5.2 Recommendations for future research ...... 166 C5.3 Conclusion ...... 168

REFERENCES ...... 170 APPENDIX ...... 186 A1: List of Figures ...... 186 A2: List of Tables ...... 188

A3: List of Abbreviations and Symbols ...... 189

5 C1.1 Stability and uncertainty in perception

CHAPTER 1 Introduction

C1.1 Stability and uncertainty in perception

Interactions with the surrounding scenery rely on a coherent and meaningful cognitive representation of physical attributes emitted or reflected by the distal surfaces such as light or sound waves. Sensory cells in the ear and the eye organs capture the radiated waves (electromagnetic or compressed air, respectively), transduce them into neuronal signals, and forward the sensations to higher-level processing (Yoshioka & Sakakibara, 2013). The raw neuronal signals need to be organized to form a consciously available perceptual representation (Y. C. Chen & Spence, 2017). The formative process begins on low sensory levels and is carried upwards to larger populations of neurons that interact with each other. Instead of passive transmission of the information stimulating the sensors by the neuronal structure, it was discovered that perception is an active process. In the course of the process, the neuronal signals that encode the sensor stimulation are “shaped” to form an internal representation that complies with the given sensory input and predicts upcoming sensory information (Binder et al., 2004; Friston & Kiebel, 2009). Therefore, in the perceptual processing, the “objectivity” provided by the raw sensory information is partly replaced by a “subjective” interpretation, an idiosyncratic representation of the environment. The principles describing how sensory signals are grouped in distinct perceptual objects found attention already more than a century ago (Fechner, 1860). The potentially evolutionary-grounded heuristics are acquired through life-long perceptual learning due to the constant exposure and interaction with visual and auditory sensory information (Barlow, 1990). Nonetheless, even the fully-developed perceptual system of a healthy adult is continuously challenged by unknowns inherent in the perceptual signal. Uncertainty can stem from insufficient prior knowledge on a specific object and temporary coarse sensory input (Millar, 2000; Pouget et al., 2013). In auditory perception, sound unfolds short-lived in time, and numerous time-scales

6 C1.1 Stability and uncertainty in perception play a crucial role in extracting the meaning (Bregman, 1990; Darwin, 1997): A single sound burst captured as a discrete sound event may signal a collision of two solid objects, whereas the repetition of very similar sounds signals the clapping of hands. With the knowledge that sounds emitted from a close-by distance is usually perceived louder than sounds emitted from far away, a series of such distinct sounds might signal a person walking away. In , the interpretation can be similarly compelling: Small objects tend to be far away and increase in perceived size when approached. Of course, the visual observation is independent of the objects' real magnitude, which is invariant (Biederman & Cooper, 1992).

The formation of an internal perceptual object that is representational and allows the recognition of the shape and location of an external object is a process that is prone to many different sources of variability (Arieli et al., 1996): like sensory fluctuations and varying cognitive states (Lutz et al., 2002; Wales & Fox, 1970). The selection of the most probable perceptual interpretation among all possible explanations is called perceptual decision (Wohrer et al., 2013). In fact, the perceptual system is very decisive, rejects numerous unlikely perceptual alternatives, and, most of the time, supplies a single, coherent representation. The perceptual system overcomes challenges like noise from all levels of neuronal responses in the processing pathway, fluctuations in the available sensory qualities, and erratic or incompatible observations in the scene semantics (Biederman & Cooper, 1992; Kornmeier & Bach, 2012). Such factors add to the underlying stochastic process in perception and make perceptual decisions the result of probabilistic computations.

The structured study of human perception motivated psychologists, physicists, and philosophers in the late 19th century to begin investigating the relationship between external stimuli and the internal perception. Gustav Theodor Fechner, Hermann von Helmholtz, and Max Wertheimer are commonly seen as the forefathers of the modern, quantitative science in psychology. They laid the foundation for what is nowadays considered basic equipment in studying the relation between psychological experience and stimulus properties (e.g., intensity). The challenge, how to approach and quantify the subjective experience, seemed for a long time an impossible endeavor (Sturm & Wunderlich, 2010). The application of mathematical language and measurements to formalize the functionality of the perceptual system was subsequently resolved by the first Psychophysicists.

The outcome of perceptual probability computations was investigated along with, e.g., the signal detection theory, which measures the capability of a sensor to differentiate information from noise (Green & Swets, 1966). Similar threshold measurements are one continuation of the earliest theoretical approaches in psychophysics, like the Weber-Fechner law (Hecht, 1924). The law

7 C1.1 Stability and uncertainty in perception describes that a linear increase of the perceived stimulus strength is evoked by a logarithmic increase in the physical stimulus intensity. Weber and Fechner related physical intensities of stimuli (e.g., luminance, sound pressure level) and subjective perception (e.g., perceived luminance, loudness) in mathematical functions. Furthermore, the detection of just-noticeable differences, the discrimination of two stimuli (e.g., point of subjective equality), and their scaling were studied along, e.g., the ideal observer analysis (Cicchini et al., 2018; Geisler, 2011; Körding et al., 2007; Watamaniuk, 1993). The ideal observer analysis is a “theoretical device” that predicts the optimal task performance, considering the available sensory signals and constraints (e.g., noise). This framework allows comparing the performance against observed empirical or simulated data generated by the perceptual system in a specific task. The appliance of noise to ideal observer models can degrade the performance and support the suggestion of real-world models. One early application of the ideal observer model was in the visual detection of light intensity and how photon noise limits the detection rate by a human observer. The rationale behind the ideal observer analysis demonstrates the common practice in the analysis of psychophysical data: The system is theoretically disassembled into components of internal and external factors and described in probabilistic terms by defining the marginal conditions. Besides the capability to detect meaning from a sensory pattern that is confounded with random signals, it was recently discovered how (e.g., external noise: blinks and internal noise: Neuronal noise) noise contributes as a functional feature to the build-up of stable perceptions (Gammaitoni et al., 2009; Parise & Ernst, 2017; Pisarchik et al., 2014; Taylor & Aldridge, 1974; Wild & Busey, 2004). The attractor models of perception assume that noise is responsible for supporting weak neuronal adaptation processes, which can emerge into consciousness by noise-driven fluctuations, see, i.e., stochastic resonance (Kim et al., 2006; Moreno-Bote et al., 2007; Shpiro et al., 2009). In the complex interplay of bottom-up (e.g., stimulus-saliency, sensory dominance) and top-down factors (e.g., attention, personality) and the respective noise sources, individual perceptions emerge, and subsequently, action can be planned (Braun & Mattia, 2010; Moreno-Bote et al., 2007; Parise & Ernst, 2017).

The observed stochastic dynamics of the perceptual system characterize a complex system that accommodates the best-fitting representation regarding the captured sensory evidence and, at the same time, allows to reconsider the momentary organization (Deco et al., 2009). In the case of changing sensory information or the discovery of a better fitting internally constructed perceptual representation, the momentary perception is suspended. Additional evidence for one representation can be sensory, knowledge-based, or stem from noise-driven fluctuations (Kersten et al., 2004). In this situation, the alternative representation renders the current interpretation of the sensory evidence unsuited. Consequently, the perceptual representation is adopted rather than

8 C1.1 Stability and uncertainty in perception an unlikely but currently dominating configuration stabilized. The perceptual system is always “attracted” to the representation that exposes the highest “evidence values” stemming from all available sources (Toppino & Long, 2015). The capacity of the perceptual system must be ambidextrous: At the same time, it must balance perceptual stability and the sensitivity to new perceptual organizations. The “exploration/exploitation” dilemma developed as a framework in reinforcement learning describes the way living organisms, like humans, constantly adapt their behavior to the transitional properties of their environment (Sutton & Barto, 2018). Instead of choosing an extreme strategy (exploration-only or exploitation-only), the system optimizes perceptual decision-making to adjust both strategies in favor of the fast and reliable representation of sudden changes in the environment (Aston-Jones & Cohen, 2005; Mathôt, 2018; Naber et al., 2011). The “sweet spot” in this trade-off is dynamic and can be systematically modulated in either direction (Einhäuser, 2017). On the one hand, in the exploration strategy, the perception is the least stable and most sensitive to the momentary sensory signal: Smallest sensory transitions are reflected on the perceptual level as prominent as a major change in the sensory information. This unilateral strategy is not allowing to extract relevant information from the unstable perceptual representation and depletes attentional resources; the information reduction is not sufficient. In the exploitation strategy, on the other hand, the perception is entirely stable and insensitive to changes in the momentary sensory signal. This strategy relies strongly on the accumulated prior knowledge and impedes the on-time updating of the perceptual representation to sensory changes. Sudden changes would only be translated in the perceptual representation after a latency, and substantial environmental changes would only incrementally find their way in the current representation.

The above introduced dynamic is equally applicable for the auditory and the visual perceptual subsystem. But the different sensory systems are not isolated to themselves: Both influence, conflict with, and support each other to provide stable multimodal representations. Sensory information from the auditory channel, coding speech information, is naturally complemented by, often redundant, visual information coding the lip movements. The McGurk effect describes a phenomenon of the cross-modal integration of conflicting visual and auditory information during the perception of speech (McGurk & MacDonald, 1976; van de Rijt et al., 2019): A video showing a person pronouncing the syllable “ba-ba” (visual information) is synced with a person saying “ga- ga” (auditory information) what is merged in the perception of many observers to “da-da”. The McGurk effect underlines the cross-modal interdependency of the perceptual system and its subsystems. Similarly, when a happy or a fearful voice (auditory information) is played along artificially morphed happy-fearful faces (visual information), the judgment on the observed emotion is faster in the congruent compared to the incongruent combination (Dolan et al., 2001).

9 C1.1 Stability and uncertainty in perception

Multimodal perception is investigated between many senses and resulted in intriguing cross-modal findings on the integrational character of the perceptual system, which might already interact on lower sensory levels (Sekuler et al., 1997). Increasingly powerful technical sensors (functional magnetic resonance imaging, positron emission tomography, magnetoencephalography, and electroencephalography) and computational analysis methods (simulations, modeling, machine learning) contributed to the neuroscientists’ understanding of the involved neuronal substrate, its location, interplay, and activation during different stages of the perceptual process on a macroscopic level (Andrews et al., 2002; Brascamp, Blake, et al., 2015; Calvert & Thesen, 2004; Molholm et al., 2002). On the microscopic level, the neurophysiology structure is similarly well understood. The signaling of neurons connected via synapses in increasingly larger assemblies of networks is the constituent of perceptual representations (Malashchenko et al., 2011). But, the intermediate architecture between the abstract, modular view on active brain regions, and the relatively easy to grasp basic neuronal circuitry that assents a perceptual experience, we encounter every day is in most parts a black box. So far, simple relations between stimulus features, the subjective perceptual phenomena, and the evoked neuronal activity could be determined (see above). Numerous theories shed light on cognitive and attentional processes, but the explicit architecture in the formation from sensations to perceptions uni- and crossmodally remained elusive (Mesulam, 1998).

An intriguing phenomenon in perception is exposing both stability and instability in the perceptual system: perceptual multistability (Leopold & Logothetis, 1999). Perceptual multistability can be evoked in the perception by a class of ambiguous stimuli that appear in their enduring form rather seldom in the wild and maybe have their conceptual roots in the arts (Carbon, 2014). In the ambiguous configuration of stimulus properties, the subjective awareness switches between alternative interpretations (Attneave, 1971). Physical qualities in the ambiguous stimulus remain constant, but simultaneously the perceptual experience keeps changing between different perceptual organizations (called percept or perceptual state) and never becomes entirely stable. The subjective experience is consciously accessible and, therefore, willingly reportable by the observer. The perceptual switch is located between two percepts and is short in relation to the duration of a percept (phase duration). The switch rate is the count of perceptual switches divided by the stimulation time. The accumulated time of one type of percept dominating awareness is called perceptual dominance time and is commonly reported relative to the whole stimulation time (relative dominance time or percentage). Traditionally, experiments are split into several blocks of prolonged stimulation (ca. 1 to 10 min), and the measures of perceptual dominance are reported as an aggregate within and then across observers (aggregated measures of perceptual

10 C1.1 Stability and uncertainty in perception dominance). The approach has helped to retrace the statistical characteristics of perceptual inference. All evoked percepts are mutually exclusive and equally compatible with the sensory evidence of the ambiguous stimulus (Schwartz et al., 2012). At first, the perceptual states are explored (emerge for the first time) before they reoccur again and again. The alternations of different percepts can be observed in visual [Moving plaids (Movshon et al., 1985; Wallach, 1935), Duck- rabbit figure (Jastrow, 1900), (Necker, 1832), Rubin’s vase-face illusion (E. Rubin, 1915), binocular rivalry (Wheatstone, 1838), structure-from-motion (Wallach & O’connell, 1953)] and auditory stimuli [auditory streaming (van Noorden, 1975), verbal transformation (Warren & Gregory, 1958)]. The measures of dominance are flexible and can incorporate percept sequences from almost all multistable processes: e.g., even peregrine comparisons of perceptual dominance evoked by static and dynamic ambiguous visual stimuli are possible. In the Necker cube, two parallel planes of a cube compete to be perceived in the foreground (percept 1: plane A; percept 2: plane B). In verbal streaming, the repetitive listening to the word “life” can be perceived as “life” (percept 1) and “fly” (percept 2). Ambiguous stimuli are an attractive instrument for Psychophysicists because perceptual decisions modulated by the actual changes in sensory information are avoided. Instead, the organizational character of the active perceptual system becomes eminent. The perceptual inference is “caught” in the ambiguous stimulus properties, which cannot be permanently disambiguated but reverse spontaneously in the subjective experience. After a perceptual switch, only limited volitional control can be exercised: Prolongation or shortening of the momentary percept is only possible until the next inevitable perceptual switch occurs (Kondo et al., 2018). The stimulus properties of ambiguous stimuli trigger the sweet spot between exploration (e.g., new perceptual organization) and exploitation in the perceptual system (e.g., keeping the momentary perceptual organization stable). Subsequently, the perceptual system reacts by balancing the sensitivity to perceptual changes and stability in the momentary percept (Kersten et al., 2004). The encounter of two percepts that are equally compatible with the sensory evidence leads to a conflicting situation, where the current perceptual organization is continuously scrutinized. The evoked perceptual dynamic allows us to observe the ongoing perceptual process: In natural scenes, several perceptual organizations are formed, but they are very similar because of the unambiguous properties of most scenes. In this case, the transition between alternative states and the perceptual decision for one is unobtrusive and not as distinct as in perceptual multistability. At what exact degree of ambiguity different perceptual organization emerges from ecological stimuli, is not yet answered (Frank, 2012). The perceptual switch between different organizations may only happen if several organizations are formed.

11 C1.1 Stability and uncertainty in perception

Perceptual multistability is a phenomenon that triggers a state in the perceptual system that might allow general conclusions on the processes that form the cognitive organization and their determinants (Schwartz et al., 2012). The reorganization of the perceptual sensations tells us what conflicts in the sensory information are detected and how they are resolved to form a stable percept. One line of research deals with the question, what exactly competes for the dominance in the perception: the physical stimulus properties itself or more abstract representations of the same. Conflicts in the sensory signal and the formation of the percept are inherent processes that are happening in unambiguous sensory information as well. The perceptual system and its neuronal underpinning are thought to operate at all times near unstable representations because it can never be certain and needs to remain receptive about the just to arrive sensory input (R. Cao et al., 2020). The research on perceptual multistability is motivated by the question of how perceptions are formed in that process. The perceptual process leaves a revealing signature in the individual perceptual experience that is accessible by introspection. The manual or verbal report of the perceptual states is classically seen as the most direct access to the perceptual experience. In many experiments, the percepts are reported via button press. A common reported perceptual sequence may look like this: button 1 pressed/percept 1 (3.6 s), no button pressed (0.5 s), button 2 pressed/percept 2 (10.1 s), button 1 and 2 pressed parallel (0.8 s), button 1 pressed/percept 1 (7.8 s). However, the actual underlying perceptual sequence consists mostly of percept 1, percept 2; transitional states usually play a menial role (but see Bakouie et al., 2017). The discrepancy between the actual and the reported perceptual experience may be attributed to, e.g., internal decision criteria, motor latencies, and forgetting. By year-long investigations of such output, i.e., the sequence of consciously experienced and reported percepts (e.g., their phase duration, the switch rate in several modalities), the question was raised, whether the perceptual process is modality-specific or modality-general [commonly coined the “global vs. distributed hypothesis”] (Denham et al., 2018). In other words: Are the principles that govern perceptual multistability in, e.g., audition and vision are distinctively different, or are they the same or similar in all modalities? And, if there is some extent of similarity: Is there a partly shared cross-modal class of rules that updates the perception? Finally, researchers specialized in different modalities collaborate nowadays to answer the question of whether the shared class of rules is neuronally implemented in a specific location in the brain (Einhäuser et al., 2020). With the experiments presented in this Thesis, we want to contribute mainly to the global vs. distributed hypothesis: Does a shared mechanism contribute to perceptual reorganization similarly in vision, audition, and audio-visually? Or, are the perceptual subsystems address the conflicts independently? This allows us to understand at what stages and which ambiguities are crossmodally bound together.

12 C1.1 Stability and uncertainty in perception

A facilitatory question for the desiderata that we address in the course of the experiments is whether internal perceptual states and transitions can be read indirectly from properties of physiological markers without the direct report of those perceptual states and transitions.

Outline of the Thesis This Thesis contains five experiments on unimodal auditory, unimodal visual and bimodal, audiovisual with a particular focus on determining to what degree perceptual switching in the visual and auditory domain is related. Chapter 1 recounts the literature bearing the theoretical foundations of perception, multistable perception and identifies the later addressed explanatory gaps. As well, the applied experimental paradigms and the uniting research questions are defined. This propaedeutic chapter can be split into four sections: Section 1, this section carried out general findings and theories on subjective perception and conceptual research questions. Section 2, the following section, will give an in-depth overview of the findings, stimuli, and models of the perceptual multistability phenomenon. The interconnections between different forms of perceptual multistability within and between modalities will be subsumed. Prominent models, potentially responsible for the empirical data, are addressed. The pairability of multistable stimuli in general and the particular selected ones for the following three studies are described (auditory streaming and moving plaids). Section 3 deals with the direct (self-report) and indirect measures (involuntary eye-movements, no-report) of momentary perceptual dominance and transitions. The usefulness and necessity to capture perceptual switches with indirect correlates of perception in no-report paradigms are discussed (Overgaard & Fazekas, 2016). In Section 4, the final section of this chapter, the aims of this Thesis are presented. On this basis, the conjunctive features of all experiments will be described. In Chapter 2, two experiments are presented that investigated the pupil dilation at times of perceptual switching in auditory multistability (four different percepts). Chapter 3 comprises a bimodal experiment that evaluates the moment-by-moment coupling of corresponding percepts of two concurrent bistable processes in vision and audition (two different percepts). In Chapter 4, one experiment investigates how the internal perceptual state in visual bistability can be biased and, in a second experiment, whether such a bias is forwarded to the concurrent auditory bistable process. In the last chapter, Chapter 5 of this Thesis, the findings are discussed in light of the current literature, future experimental designs are recommended, and forthcoming questions posed.

13 C1.2 Auditory, visual, and audio-visual multistability

C1.2 Auditory, visual, and audio-visual multistability

Perceptually ambiguous stimuli evoke multistability in all senses (Schwartz et al., 2012). In visual multistability, e.g., the moving plaids stimulus consists of two overlaid gratings (Wallach, 1935). The motion of the gratings “behind” an annulus can be perceived as two gratings moving coherently in one direction at the same speed (integrated percept, Figure 1.1A1) or as two separate gratings moving in opposite directions (segregated percept, Figure 1.1A2). In audition, auditory streaming is a repeating sequence of short tones that differ in frequency (van Noorden, 1975). Depending on the momentary organization, the stimulus can be perceived as both tones interconnected in a varying frequency stream (integrated percept, Figure 1.1B1, C1) or as two separated streams of tones with two distinct frequencies (segregated percept, Figure 1.1B2, C2). The percept space describes a set of perceptual organizations. Either the percept space can subsume all theoretically possible percepts or, as in most experiments, the instructed percept space is limited to facilitate some degree of consistency between the participants. The percept space of the auditory streaming stimulus can vary: In Study 1, the percept space of the auditory streaming stimulus is set to four percepts, called auditory multistability. In Study 2 and 3, the percept space of the auditory streaming is set to two percepts, called auditory bistability. In both studies, the visual stimulus (moving plaids) are constructed and instructed to have two reportable percepts, called visual bistability. The switching rate in bistable stimuli is sometimes called reversal rate because the awareness can only alternate between two percepts. In the multistable case, the perceptual transition is measured in the switching rate, and the switching direction must be considered. There can be found other studies where the moving plaid stimulus is instructed to have three percepts, visual tristability (Wegner et al., 2021).

14 C1.2 Auditory, visual, and audio-visual multistability

A1 B1 ABAB C1 ABA_

B B (…) (…)

frequency A frequency A time⟶ time⟶

A2 B2 ABAB C2 ABA_

B B (…) (…) frequency frequency A A time⟶ time⟶

Figure 1.1 Moving plaid stimulus containing a black and grey grating that is either perceived to move in the same direction (A1, thick arrows) or opposite directions (A2, thick arrows). Remarkably, during the dominance of either percept, the physical moving directions of the plaid in the foreground (black) and background (dark grey) remains exactly the same (A1 & A2, thin arrows). Auditory streaming stimulus in the “ABAB” (B) and the “ABA_” (C) configuration can be perceived as being bound together in one coherent stream (B1 & C1) or separated from each other into two streams (B2 & C2).

The multistability phenomenon exposes the dissociation between raw sensory information and the processed, biased, and weighed cognitive representation available to conscious report (Leopold & Logothetis, 1999). The framework of ambiguous-stimulus and subjective-report uncovers hidden principles of the perceptual system: It can be observed how the perceptual system evaluates perceptual decisions and handles equally plausible interpretations (see discussion in Gershman et al., 2012). Due to a variety of the similarities in the reported switching pattern across modalities in ambiguous stimuli (Pressnitzer & Hupé, 2006), the consideration of a “universal principle” or “shared mechanism” producing contrastable percept patterns in several modalities is in the discussion (Long & Toppino, 2004; Sterzer et al., 2009; Walker, 1978).

Cross-modal correlations The analysis of correlations has a long tradition in psychology science measures: The relationship between two parameters (e.g., numbers of perceptual switches produced by two ambiguous stimuli) can be measured directly and allows to predict the values from parameter A based on the observations made in parameter B. For example, the number of perceptual switches during a 2- min stimulation with a moving plaid stimulus (parameter A) is in the same individual accompanied by a certain number of perceptual switches during a 2-min stimulation with an auditory streaming stimulus (parameter B). The variance between several individuals allows describing the positive, negative, or absent relation between the parameters. A positive correlation between the parameters of perceptual dominance is by most researchers understood as evidence in favor of a

15 C1.2 Auditory, visual, and audio-visual multistability shared global or a common underlying mechanism triggering perceptual switches (Pressnitzer & Hupé, 2006). The underlying partly deterministic and partly stochastic process (R. Cao et al., 2016; van Leeuwen et al., 1997), which might be responsible for the statistical commonalities of multistability phenomena, is encapsulated by further individual psycho- und physiological components: personality traits (Farkas, Denham, Bendixen, Toth, et al., 2016), neurotransmitter concentrations (Kondo et al., 2017), and genes (Schmack et al., 2013). An individual that tends to switch in visual ambiguous figures more frequently might be switching the percepts in an auditory ambiguous stimulus as well more frequently in general. Enduring traits and passing states (like emotions Jolij & Meurs, 2011) contribute to the inter-personal variability and might cover up inter-modal and inter-stimuli effects when compared directly. Correlative techniques can utilize individual factors and interpersonal variance influencing the direct comparison of aggregated measures of perceptual dominance (e.g., number of switches, switching rate, relative dominance duration). A variety of unimodal visual, auditory, and bimodal audiovisual perceptual multistability phenomena was investigated: In the visual modality, e.g., switch rates were compared across eleven different forms (Spinning dancer, Lissajous-Figure, Rotating cylinder, Necker cube, moving plaid, translating diamond, Motion induce blindness, Biological Motion, Rubin’s vase-face, Rolling Wheel, Binocular rivalry) of multistability and correlated moderately intra-individually only for a fraction of the studied material, up to r = .46 within some clusters (T. Cao et al., 2018). As well, significant intra-individual correlations for the visual stimuli Kinetic-depth display and Necker cube could only be identified within some pairs [both Kinetic-depth display] but not for other pairs of stimuli [Kinetic-depth display and Necker cube] (Brascamp, Becker, et al., 2018; Pastukhov et al., 2019). In the auditory modality, the count of perceptual switches in auditory streaming and verbal transformation exhibited a moderate correlation of r = .40 (Kashino & Kondo, 2012). In the comparison of auditory and visual (both unimodal) multistability, the findings were equally dispersed as between the unimodal visual stimuli but showed similar moderate strength in the correlations. The combination of the ambiguously left or right rotating structure-from-motion sphere [visual stimulus] and auditory streaming [auditory stimulus] elicited moderate correlations in the number of perceptual switches of r = .456 between modalities (Denham et al., 2018). Pressnitzer and Hupé (2006) found a moderate correlation of r = .40 between the number of switches in an auditory streaming [auditory stimulus] and a visual moving plaid task [visual stimulus]. For the same combination of the stimulus (auditory streaming and moving plaids), Kondo et al. (2012) found a significant correlation ranging from r = .30 to r = .58. Moreover, in their study, the variability for additional forms of multistability (Necker cube, Rubin’s vase-face, verbal transformation) ranged all together between r = .11 and r = .75 (with a latent variable correlation visual-auditory of r = .37). Kondo, Pressnitzer, Shimada, Kochiyama, and Kashino

16 C1.2 Auditory, visual, and audio-visual multistability

(2018) could not find correlations in the proportion of integrated percepts when separately presenting moving plaids [visual stimulus] and auditory streaming [auditory stimulus]. Conversely, for the same stimulus combination (with three instead of two percepts; tristability), Einhäuser, da Silva, and Bendixen (2019) could actually find cross-modal correlations for the dominance time of the integrated percept and cross-modal coupling between percepts on a moment-by-moment basis (bimodally stimulated). The over-all moderate correlations indicate that some ambiguous stimuli presented “fit” better together than others.

Compatibility of multistability phenomena In the conceptual perspective, different categories of visually ambiguous figures of dynamic, static, binocular rivalry, and whole-part-figures evoke multistability based on different aspects of stimulus properties (Strüber & Stadler, 1999). Different stimuli might trigger competing representations of the sensory information on lower sensory or higher, more abstract semantic levels of the perceptual system, depending on where the ambiguity in question stems from (Lakatos et al., 2009; Meng & Tong, 2004; Stein & Stanford, 2008). Different cognitive processing levels might oscillate in different time regimes, which consequently generate distinct switching patterns in the conscious perception (i.e., varying dominance durations). If switching patterns of several ambiguous figures should be compared, it is necessary to consider ambiguous stimuli that compete on a similar level. Not only, the number of theoretically evoked percepts is one factor limiting the potential pairing, but the dynamics inherent in a specific stimulus: Combining the static, Rubin’s vase-face illusion (percept 1: vase, percept 2: face) with static, young-woman/old-woman illusion might (percept 1: older woman, percept 2: younger woman) “couple” on a similar level (both bistable, both static, ambiguity on figural level). In contrast, combining either with a dynamic and abstract structure-from-motion stimulus (bistable, dynamic, ambiguity on a movement level) might not correlate unimodally or couple directly when presented simultaneously, even though all three examples separately evoke bistability (two percepts). In the upcoming three subsections, findings from the parallel stimulation with ambiguous stimuli uni- and multimodally are introduced. One essential mechanism in the scene analysis is the objectification by grouping different elements (e.g., tones or gratings) of the raw stimulus into a combined or several detached objects (Bregman, 1990; A. Treisman, 1982). The contextual contact points of ambiguity might directly influence whether certain multistable phenomena evoke aggregated measures of perceptual dominance that correlate higher than others. The often-contrasted subjective experience of the ambiguous stimuli moving plaids (for vision) and auditory streaming (for audition) play a key role in today’s multistability research. Both stimuli allow equally sized perceptual spaces (e.g., bistability), and the individual percepts theoretically share the number of resulting objects

17 C1.2 Auditory, visual, and audio-visual multistability in either percept (integrated or grouped percept: one perceptual object, segregated or split percept: two perceptual objects).

Unimodal visual (parallel stimulation): Evidence for local binding within the modality. In this subsection, evidence from unimodal visual stimulation that relied on the manual or verbal report of the momentary percept or perceptual switches is presented. In the unimodal visual, field evidence was found in favor of a local binding of perceptual organizations. When several ambiguous figures were presented in parallel, percepts co-occur, which is associated with a global mechanism. But, some degree of independence of perceptual states in the unimodal parallel visual stimulation were identified, which is associated with a distributed mechanism, as well. In 1913 John Carl Flügel discovered that the independence of percepts in the multiple ambiguous figure presentation (e.g., two Necker cubes side-by-side in parallel) advocates against a globally configured “rhythm generator” affecting the whole modality at once. Instead, he proposed a “strictly localized” process within segregated fields of the sensory . Additionally, in reversing after-images of rotated Necker cubes, Toppino and Long (1987) observed a similarly independent pattern between percepts and proposed a two-stage model moderating local fatigue of the neuronal channels (Hock et al., 1997) and global learning processes (Long et al., 1983). Adams and Haire (1959) found in two nested Necker cubes (a smaller in a bigger one) that if the Necker cubes differed in their respective orientation, the reversal rate became discriminable between both Necker cubes. That was not the case if the Necker cubes shared the same orientation. The observation that the parallel visual co-stimulation allows some degree of independence in the perceptual switches shows flexibility even on low sensory levels. Percepts in equal or similar ambiguous objects are not “glued” together, but the perceptual co-oscillation relies on the coherence of the ambiguities of the stimuli (Kubovy & Yu, 2012). The biological implementations of such localized models of perceptual multistability underline the leading role of the lower level, sensory-specific cross-talk, which might or might not be not centrally modulated (Klink et al., 2008; Noest et al., 2007). However, these studies do not rule out a dedicated global component in control of perceptual multistability. In contrast to the above theories, the global component of perceptual multistability contributes to the perception by a common mechanism rather than an entirely selective process separately for each visual object or the perceptual subsystem as a whole (Leopold & Logothetis, 1999).

Unimodal visual (parallel stimulation): Evidence for global binding within the modality. The global component might materialize itself in the statistical properties of similar subjective experiences evoked by various ambiguous figures at the same time. In this subsection, we present evidence

18 C1.2 Auditory, visual, and audio-visual multistability from report-studies during parallel visual stimulation were the binding of perceptual organizations is investigated: Ramachandran and Anstis (1983) observed identical oscillations in the “synchronized motion” of a matrix-like layout and randomized location layout of twelve bistable visual motion stimuli (exact copies). The “presence of global field-like effect for resolving ambiguity in apparent motion“ hints at a global component. Further evidence was found by Grossmann and Dobbins (2003) in pairs of visual multistable stimuli (Kinetic dot radar dish, Kinetic dot cubes, Kinetic dot ellipsoids, Necker cube), which were presented in parallel. In their study, two sets of similar and dissimilar ambiguous figures elicited coupling of the percepts for averagely >70% of the stimulation time in all pairs, which vanished if one stimulus was partly disambiguated. When Grossmann and Dobbins (2003) biased the ambiguity of one of the two concurrent stimuli (to a disambiguous version), the common time spent in both stimuli in the same percept decreased dramatically. In other words, the ambiguous stimulus decoupled from the disambiguated or the disambiguated stimulus could not control the percept in the ambiguous stimulus. The inter-stimuli effect (potentially backed by a global mechanism) fainted proportionally to the degree of growing disambiguation. Grossmann and Dobbins (2003) propose that a top-down feedback system is selecting and stabilizing percepts. In contrast, Freeman and Driver (2006) highlight the sensitivity of the perceptual system for the constrained pairability. They demonstrated coupling of two in parallel presented structure-from-motion stimuli even if one was disambiguated. Freeman and Driver (2006) concluded that this kind of information- sharing is only a contextual effect that developed as a useful heuristic in natural scenes (e.g., for repetitive patterns). Compounding these findings imply that there is some degree of global influence contributing to the control of the perceptual switches and the coupling of percepts if certain conditions are met (Kleinschmidt et al., 2012; Schwartz et al., 2012). One might speculate, how contextual factors increase the co-occurrence of the percepts might be grounded in the grouping principles formulated in the Gestalt theory [e.g., proximity, similarity, common fate, belongingness, closure] (Wertheimer, 1923). The upcoming of the percepts might rely on the stimulus-inherent formation of inter-object links (e.g., dynamics, shape), which consequently compete on different stages of the perceptual hierarchy. Potential unifying neuronal substrates involved in the global component comprise a central oscillator theory (e.g., in the brainstem) or top-down streams (possibly attention-related processes) that tune the sensitivity of low-level sensory neurons for a particular percept collectively (Carter & Pettigrew, 2003; Devyatko et al., 2017; Einhäuser et al., 2008; Pettigrew & Carter, 2004). Considering the evidence from the unimodal visual field, a central selective process triggering perceptual switches in all senses together is rather not supported (Walker, 1978): The partly independence of the percepts speaks against a monolithic global unit governing visual multistability. As well, the independence of the visual percepts is not completely ruling out a global component

19 C1.2 Auditory, visual, and audio-visual multistability at all times. Instead, visual multistability seems to be a reciprocation of localized, distributed, and global processes.

Multimodal (parallel stimulation): Binding between modalities. In contrast to unimodal stimulation, multimodal stimulation allows us to observe how multistable processes interact crossmodally. It is proposed that the selection of percepts, even in the unimodal case, can be mediated globally by, e.g., attentional down-streams (Snyder et al., 2012). In the multimodal case, some form of crossmodal interaction may balance the localized competition of percepts in two modalities if percepts co-occur systematically. Few recent studies investigated the direct coupling of perceptual interpretations simultaneously with both stimuli ambiguous (V. Conrad et al., 2010; Einhäuser et al., 2020; Hupé et al., 2008). The moderate or sometimes absent correlations of perceptual dominance measures between unimodally presented ambiguous figures suggest that even though functional commonalities can be determined, a locally independent process or distributed competition on different levels might govern multimodal multistability (Denham et al., 2018; Kondo et al., 2012; Kondo & Kochiyama, 2018). For that reason, it is crucial to study simultaneous multistable processes and their respective coupling between modalities. Not all stimuli are theoretically plausible to couple crossmodally: Einhäuser et al. (2020) remarked that some stimulus combinations might be inadequate and that other statistical or instructional technicalities prevented earlier studies from detecting more pronounced correlations as they did themselves: The unimodal stimulation with auditory streaming and moving plaids separately which exhibited strong cross-modal correlations (e.g., perceptual dominance of the integrated percept, r >= .612). Strong correlations are associated with a global mechanism. Additionally, they had a bimodal “mixed” condition where both stimuli were presented simultaneously. In the mixed condition, they found intraindividual moment-to-moment consistency of the percepts between modalities. They interpreted the results “as evidence for at least partially overlapping mechanisms for multistability in both modalities”. With very similar stimuli (auditory streaming and moving plaids in parallel), Hupé et al. (2008) investigated the direct interactions of perceptual decisions in bistability: They asked all eight right- handed observers to report the perceptual state (press to signal integrated percept, release to signal segregated percept) in vision (right button) and audition (left button) in parallel. The parallel manual report is seen as a notorious challenge to investigate perceptual multistability in several modalities at once: As a consequence, the attention had to switch from one modality to another at a self-paced rate to detect switches in both modalities (see next section C1.3). In the investigation of auditory streaming and moving plaid stimuli, they found that the unimodal or the bimodal stimulation would not affect the switching rate differently. Perceptual switches did not co-occur in both modalities together (< 4 s). However, the “common time” (joint predominance

20 C1.2 Auditory, visual, and audio-visual multistability of coherent percepts across modalities) was well above chance (up to 60%). Independence of “cross-modal effects” (probability of the same cross-modal percepts) and the “long-term statistics” (switches and integrated percept) was observed. Along with a global component, they propose a distributed mechanism that might be responsible for “perceptual decision (…) at different stages of neural processing”. In the end, Hupé et al. (2008) argue in favor of a mixture of both theories: “distributed but ubiquitous neuronal architecture”. However, the observation of concurrent multistable processes offers the opportunity to witness perceptual switches and perceptual consistency on a moment-by-moment basis.

Cross-modal interferences: Biasing between perceptions. From the results above, it becomes evident that there is some degree of a global factor in perceptual multistability involved. The visual studies on parallel multistable figures indicated that the presentation of ambiguous and disambiguous stimuli together reduces the inter-stimuli coupling of percepts. Here, we take a closer look at cross-modal interferences in perception to understand the cross-talk of the modalities: When combining apparent motion rivalry in vision incongruently and congruently with multistable touch perception simultaneously, multisensory binding is expressed in bidirectional interactions (V. Conrad et al., 2012). Even though the multistable processes in the visual and tactile system are paced by different temporal parameters (or Bayesian priors), the spatially congruent visuotactile stimulation increases the dominance times and the bias towards the percept, which was dominant in the unimodal setting already and vice versa. The deceleration of the rivalry dynamics (increases of the dominance times and stabilization of the percepts) is interpreted as increasing sensory evidence for the momentary percept that is nurtured in the multimodal setting by both sensory channels and in the unimodal setting by only one. Remarkably, in this study, the authors report that ”subjects were not able to reliably track and report their visual and tactile percepts simultaneously”. Therefore, in the last experiment of their study, the observers were probed every 5.5 s to report their visual or tactile current percept verbally and alternatingly between modalities. The perceptual dynamics were withstanding the continuous cross-modal attentional switching because the coherent visuotactile stimulus provoked extended periods of sequential coupling. In a similar vein, it was investigated how incongruent or congruent or non-informative auditory sensory information interacted with paralleled visual motion percepts in binocular rivalry (V. Conrad et al., 2010). Again, even unattended auditory motion sensory information coupled to the visual percept and prolonged the coherent percept. Similar audiovisual interactions were observed when the visual bounce-pass stimuli are combined with a sound at the collision time of the two visual targets. Unimodal-visually, two targets (i.e., white dots on a black background) move towards each other, intersect and move apart, which is

21 C1.2 Auditory, visual, and audio-visual multistability perceived ambiguously as either two bouncing (percept 1) or two passing targets (percept 2). The contextualizing sound biases this perception towards the bounce rather than the pass percept (Shimojo & Shams, 2001). Hence, feedback of “multisensory cells” might reduce the uncertainty of the ambiguous perceptual organization in one modality by integrating sensory information from another (Sekuler et al., 1997). The cross-modal phenomenon persists in higher, abstract representation, which relies on contextual semantic information: The auditory context (older/younger woman’s voice) modulates the visual perception of the observers’ dominantly perceived person (older/younger woman) in the visual ambiguous young/old woman paradigm (Hsiao et al., 2012). Audiovisual congruency constrains the raw, visual sensory processing by adding likelihood to the coherent percept. In the day-to-day experience, visual and auditory information are fundamentally interwoven. The verbal transformation effect describes the re-grouping of words or phrases under prolonged looped stimulation; in this stimulus, the repetitive played word “trice” would be perceived as “choice”, “Esther”, “toys”, “Joyce” and many others (Warren, 1961). In another experiment by Sato et al. (2007), this ambiguous auditory stimulus was contextualized with incongruent and congruent visual articulatory gestures (lip and face movements), revealing “evidence for an intermodal effect”. In three experiments, Sato et al. (2007) found that the visual information influenced the verbal transformation in a “common representation space”. The stability of the auditory percept was decreased by the visual influence. Again, like Conrad et al. (2017), they found distinct priors for the pace of switching between modalities. The audiovisual in- or congruency would trigger perceptual switches, and the visual information would play a leading role in the detection of the perceptual transformations (called “visual bootstrapping effect”). Still, audiovisual interactions in multisensory speech processing happen bidirectionally in many unambiguous contexts (Alsius & Munhall, 2013). Some researchers advocate that cross-modal disambiguation might be an artifact from attentional reallocations (“attention feed”) when one perceptual interpretation is more coherent with a percept in the second modality or the sensitivity for perceptual switches in all modalities rises in general (Van Ee et al., 2009). This question is hard to disentangle because the role and functionality of attention in multistable processes are not yet transpicuous (Shiffrin & Grantham, 1974). In this section, we presented findings that show how ambiguous visual representations be subject to and how disambiguated visual representations have some limited control over simultaneous multistable processes in touch, audition, and speech processing. The binding of ambiguous and disambiguous multisensory information highlights the integrative character of the perceptual system and is the prerequisite for a probably shared mechanism in perceptual multistability.

22 C1.2 Auditory, visual, and audio-visual multistability

A hybrid model of perceptual multistability The presented studies provide no conclusive evidence on the debate over the two main models (global vs. distributed processing). For that reason, efforts by various groups brought up an alternative model (Toppino & Long, 2005). The hybrid model incorporates both findings: It merges high and low-level models in perceptual multistability. Taking the presented contextualized audiovisual interactions together, Klink et al. (2012) propose ”an early cross-modal boost” that is complemented within a two-stage model by “high-level strengthening”, reviving the two-stage model of Long et al. (1983). In other words, early perceptual processing stages compete for dominance go through neural adaption (Ling et al., 2010). An assembly of neurons can only inhibit for some time the other assembly before this competing neuronal assembly takes over (mutual inhibition). But, the high-level influence intervenes in the local competition and delimit it from a purely low-level feed. The hybrid model features the multi-level processing of perception in general. The high-level strengthening in this model might reflect the global component of the perceptual architecture, which can be directed by a supramodal mechanism. Many authors consider the hybrid model of perceptual multistability (Long & Toppino, 2004), which embody both domain-specific (Meng & Tong, 2004; Stein & Stanford, 2008) and domain-independent streams (Lakatos et al., 2009) [feedback, feedforward] that converse on both sensory [e.g., phase-reset and modulation of local cortical excitability] and cognitive levels [attention, learning, volition]. In the hybrid perspective, the common mechanism of the perceptual system is expressed in the dynamics of the conscious percepts and in neural correlates, which broadcasts down to primary sensory regions (Sanchez et al., 2017). As outlined above, we are interested in the functional similarities in the emergence of the perceptual states across modalities. Several independent distributed or one global mechanism might be contributing to the perceptual decision in audition and vision. First, we want to investigate one universal neurophysiological marker of brainstem activity, which is responsible for the regulation of the central nervous system (see Section 1.3): The dilation of the pupil reacts similarly to reported perceptual switches in various forms of visual ambiguous figures. If it can be shown that reported perceptual switches in auditory ambiguous stimuli evoke a similar , a global mechanism in the brain to perceptual switches stand to reason in audition and vision. Subsequently, we make provisional conclusions on the unifying neuronal substrate that are truly perception-related and that triggers perceptual switches in a supramodal fashion. Experimentally, we go beyond the accumulation of evidence from unimodal experiments for the distributed and global influences on the perceptual systems and evoke perceptual bistability in two modalities in parallel: Second, we want to investigate the direct coupling of percepts during audiovisual co-stimulation. For that purpose, we need to face an issue that all bimodal studies until today were challenged by. The parallel manual or verbal report of the momentary perceptual

23 C1.2 Auditory, visual, and audio-visual multistability state in both modalities. For that purpose, we selected a specific visual stimulus that allows reading the percept from recorded eye-movements. That advancement allows us to query the momentary auditory percept from the participant in manual report and compare it crossmodally on a moment-by-moment level.

Ambiguous stimuli applied in this Thesis This subsection describes the stimuli properties of all stimuli applied, the percept spaces, and the similarities between the stimulus material: In the unimodal Study 1, in both Experiments 1 and 2, an auditory streaming stimulus is administered to evoke auditory multistability with four perceptual states [two depicted in Figure 1.1B] (van Noorden, 1975). The stimulus contains a repeatedly played sequence of interleaved ‘A’ and ‘B’ tones (ABAB…) that differ from each other in frequency: tone A, tone B, tone A, tone B, tone A,… . This stimulus configuration can be perceived as originating from one source producing tones of different frequencies (the so-called integrated percept, Figure 1.1B1) or from two sound sources with distinct frequencies equally in the foreground (the segregated_both percept, Figure 1.1B2); additionally, when the perception is segregated either of both streams can be perceived in the foreground (segregate_low, segregated_high, not depicted, see Study 1). At the reported perceptual switches between the four precepts, the response of a neurophysiological marker (dilation of the pupil, see section 1.3) was compared to non-ambiguous perceptual switches and a motoric control condition. For the perceptual control condition, a novel disambiguated auditory stimulus (based on chirp sounds) was developed, mimicking all four percepts separately in a disambiguated replay condition. In this way, perceptual switches were externally induced and pupillary effects compared to the internal perceptual switches in an ambiguous situation. The chirp stimulus was subsequently used in Study 2 and 3 to verify the correct association of the percepts with the buttons (self-report, see section 1.3) and the attendance to the task-relevant, accurate modality. In Study 2 and 3, we combined an adaption of the auditory streaming stimulus (Figure 1.1C), namely two tones A and B in the cycling configuration “ABA_” [tone A, tone B, tone A, silence, tone A…] (van Noorden, 1975), and moving plaids (Wallach, 1935; this specific version Wilbertz et al., 2018) [Figure 1.1A] in a bimodal and bistable experiment. The instructed perceptual states have been reduced to the two cardinal perceptual organizations in both modalities: the integrated (Figure 1.1A1, C1) and the segregated percept (Figure 1.1A2, C2). The further split of the segregated percept in segregated_both, segregated_low, and segregated_high was omitted. A lower number of percepts is easier to recall and distinguish, what is making the decision for one perceptual organization smooth. The variability of the finally perceived percept space demonstrates to what extent auditory streaming and other multistability phenomena are prone to prior learning (Adams, 1954; Ammons, 1954) and expectations (Girgus et al., 1977). Additionally,

24 C1.3 Capturing the subjective perceptual experience the frequency distance between tones A and B was adapted to evoke similar auditory dominance times in the integrated and segregated percept. The moving plaid stimulus, e.g., the opening angle between the gratings, was in Study 2 individually tuned to evoke similar visual dominance times in the integrated and segregated percept. By tuning the perceptual dominance to a similar range in both stimuli, it is possible to statistically test equitable the moment-by-moment coupling in perceptual bistability between vision and audition in Study 2 and 3. In Study 3, the individual tailoring of the visual parameter is abandoned because the perception was found to be stable across observers for this stimulus configuration and parameter space in Study 2. Table 1.1 gives an overview of the stimuli applied in the experiments of this Thesis.

visual stimulus auditory stimulus Study 1 Experiment 1 auditory streaming “ABAB”

(unimodal) Experiment 2 auditory streaming “ABAB” (unimodal) Study 2 Experiment 1 moving plaid auditory streaming “ABA_” (bimodal) (opening angle) (frequency difference) Study 3 Experiment 1 moving plaid (unimodal) Experiment 2 moving plaid auditory streaming “ABA_” (bimodal) (frequency difference) Table 1.1 Stimuli used in the experiments. The individually tuned stimulus parameter is denoted in brackets, where applicable. Auditory, visual, and audiovisual demonstrations of the stimuli can be retrieved online from the Supplements of the respective Study.

C1.3 Capturing the subjective perceptual experience

Direct self-report: Describing the conscious experience Traditionally, investigating the dynamics of perceptual experiences relies on the veridical and correct signaling of the observer of the momentary percept. Making the perceptual switch accessible to the report requires introspection and the willingness to notify switches (Hanna & Thompson, 2003). Even in the absence of the report, brain-activity markedly changes between the perceptual states. It was shown that subjective measures of perceptual dominance could be taken seriously for conceptual and empirical arguments (Overgaard & Fazekas, 2016; Overgaard & Mogensen, 2017; Tsuchiya et al., 2015). For understanding conscious experience and the

25 C1.3 Capturing the subjective perceptual experience subjective impression of the outer world, it is necessary to acquire subjective measures of the experience that go along with certain stimuli. Until now, only the individual can directly access and report the subjective and conscious ongoing perception. Structural limitations might lie within the reaction time and correct memory recall of the percepts reliably instructed (Kleinschmidt et al., 2012); A problem that is commonly checked for by response verification (e.g., catch trials, see below). The approach is the continuation of the human-centered research methods of the early psychophysicist. The report of a perceptual switch itself involves decision-making, memory recall, and metacognitive evaluation (Deroy et al., 2016). If the report is captured in button- presses, motor planning, and motor execution are necessarily part of the cognitive task in perceptual multistability paradigms (Basirat et al., 2012). Previous studies of multistable perception built on observers’ subjective self-report of their current percept by pressing or releasing one or several buttons (e.g., Denham et al., 2013; Hupé et al., 2008). As well, there is a merit of intuitive and ergonomic report interfaces (Denham et al., 2014) which are operated by mouse (Thomassen & Bendixen, 2017) or position or verbal response (V. Conrad et al., 2010). Subsequently, the timing of perceptual switches, their count, and relative dominance of percepts could be quantified in the aggregated measures of perceptual dominance. In most cases, the reportable perceptual states are presented before the experiment in an instructional text and shape the priors (Sandberg & Overgaard, 2015); Sandberg and Overgaard (2015) measured with a questionnaire the effects of a “good instruction”. As explained above, moving plaids can be instructed to be perceived as bi- and tri-stable; most auditory streaming stimuli can be perceived as bi- (Study 2 & 3) or multistable (more than two, Study 1). The perceptual experience of ambiguous stimuli depends on the suggested perceptual alternatives learned from the instructions and stored as long-term memory. Perception is susceptible to the style of instruction (Overgaard et al., 2006): In some situations, the perceptual representation would remain constant, only the classification of the percept would change. An a priori learned perceptual organizations (e.g., integrated percept, segregated percept, etc.) shape the expected subjective experience and narrow down the theoretically possible broader idiosyncratic switching pattern. Besides the processes facilitating a manual report, a cognitive decision criterion needs to be implicitly formulated, decided for, and kept consistently over the whole course of the experiment. If the currently perceived interpretation exceeds the decision criterion of the arbitrary “similarity scale” with any of the instructed states, the perceptual switch is reported (Koriat, 2011). Of course, those responses are clogged with latencies and variations in latencies: e.g., it can be easier to distinguish certain forms of perceptual states than others, leading to slower reaction times in certain switches (Panagiotaropoulos et al., 2020; Pelofi et al., 2017). In the forthcoming three studies, we compare the aggregated behavioral measures (button-press) of perceptual multistability for each type of percept separately.

26 C1.3 Capturing the subjective perceptual experience

Reports of subjectively switching perceptual representations have been externally verified employing response verification procedures and testing different modes of introspection: In response verification, perceptual switches and the dominating percept are externally evoked by a stimulus mimicking the perceptual organization externally (Farkas, Denham, Bendixen, Toth, et al., 2016; Krug et al., 2008). If the hit rate (consistency between reported percept and mimicked percept) for a specific percept is significantly over chance, the association of the changing internal representation, the instructed percepts, and the externally changing stimulus exhibits some degree of consistency. Commonly that is seen as evidence for the coherence between the report and the actual subjective perceptual phenomena. Even though there is some degree of coherence between the presented disambiguated stimuli and the perceptual organizations reported under ambiguous stimulation, still, the instrument cannot be regarded as a perfect copy of the internal representation that emerges in ambiguity. We applied methods to verify the validity of the responses: In some phases of stimulus presentation of all studies, we used disambiguated versions of the visual and auditory stimuli. Whenever the disambiguated version percept was correctly reported, this served the verification of the introspection performance, which we refer to as catch trials (Study 1) and response verification (Study 2 and 3).

No-report paradigms: Accessing the consciousness Under certain conditions, perceptual states and switches can be inferred objectively, omitting self- report and meta-cognition. That is of special interest if two multistable processes need to be monitored simultaneously. The observer is, in some cases, overcharged with the correct and “on time” report of percepts in several modalities at once (V. Conrad et al., 2010, 2012; Hupé et al., 2008). As introduced above, the reporting process has challenges in itself. For objectifying the capture of the perceptual dynamic and allowing observation of dual multistable processes, no- report paradigms were established. In the no-report paradigm, the perceptual state is inferred from neurophysiological markers (Pastukhov & Braun, 2007), pupillary responses (Tsuchiya et al., 2015), blinks (Leopold et al., 2002), or eye-movements (Einhäuser, Thomassen, Methfessel, et al., 2017). No-report paradigms may have other pitfalls and shortcomings, like Tsuchiya et al. (2016) discuss: e.g., during no-report tracking of perceptual switches, the neuronal correlates of consciousness might be underestimated because some percepts can only be experienced when actively attended. One advantage of no-report paradigms is the omission of an arbitrary but implicit decision criterion (e.g., “I report the integrated percept if ca. 90% similarity with the instructed percept is given.”) that is replaced by an automatic response to perceptual switches following a “natural touchstone”. No-report paradigms have been utilized to study subjective experience in, e.g., perceptual multistability, in the absence of any manual or verbal report (Tsuchiya et al., 2015). Originally,

27 C1.3 Capturing the subjective perceptual experience no-report paradigms were utilized in functional magnetic resonance imaging and electroencephalography studies to determine the true neural correlates of consciousness without having confounding influences from cognitive components that mislead to wrong places in the neuronal architecture (Blake et al., 2014). No-report paradigms allow to abolish overt report and reduce brain-activity that might be distracting from the true perceptual processes. Such paradigms are not limited to brain-imaging and brain-electric studies but were recently extend to neurophysiological markers such as the dilation of the pupil and reflexive eye-movements [e.g., the optokinetic nystagmus, abbr. OKN] (Frässle et al., 2014). The advantage of these paradigms is to spare the manual or verbal report process: The momentary percept is “automatically” selected, and volition might play a subordinate role.

In Study 1, we are confronted with exactly such confounding influences of the reporting process during the capture of the pupil dilation in response to manually reported perceptual switches in perceptual multistability. Therefore, we apply several control conditions to prevent over- or underestimation of the pupillary response. In Experiment 2 of Study 1, we applied a no-report paradigm in the detection of perceptual switches that were based on changing sensory evidence to clean the pupillary signal from confounding cognitive processes, such as novelty and surprise (Kloosterman, Meindertsma, van Loon, et al., 2015). In Study 2 and 3, we recorded the eye-movements during the observation of the moving plaid stimulus, which has a specific configuration to evoke distinguishable properties during the perception of the integrated or the segregated percept. We can exploit the automatic and reflexive pattern called optokinetic nystagmus and retrieve the percept. Since we stimulate in parallel (bimodally) the participants with an auditory streaming stimulus, the reporting capacity (e.g., attention) can be focused on the auditory percept (integrated and segregated). The momentary visual percept is read out from eye-movements properties (OKN) induced by the stimulus. The technique is based on individually trained support-vector-machine models (SVM) that allow decoding the properties of the OKN in distinct classes (percept 1: integrated, percept 2: segregated). In the visual stimulus (moving plaid, Figure 1.1A), the OKN is sparked by the moving plaid stimulus with two distinct moving directions of the gaze position that can be identified by the SVM decoding procedure; a technique that is stemming from machine learning (Wilbertz et al., 2018). Moving plaids are two grating patterns that are overlaid by a circular aperture. If the gratings move with an opening angle other than 180° or 0° over each other, observers commonly perceive the integrated (both gratings move in the same direction, Figure 1.1A1) or the segregated percept (the gratings move in different directions, Figure 1.1A2). In this specific case, the integrated percept is the union of the dark grey grating in the background with the black grating in the foreground. Both together are then perceived to be moving towards

28 C1.3 Capturing the subjective perceptual experience the right side of the stimulus. The segregated percept is the black grating in the foreground, which moves towards the lower end of the stimulus, ignoring the dark grey grating in the background. The moving directions of the two percepts are nearly perpendicular to each other to facilitate SVM-performance optimally. By introducing this indirect perceptual readout, the observer only needs to actively report (button-based) their momentary auditory percept because the visual percept can be decoded (automatically followed by the eye-movements) from the OKN alone in Study 2 and 3.

Correlates of perceptual states: Optokinetic nystagmus When sitting in a moving train and one fixates on a tree in the distal surroundings, the retinal image can only be stabilized for some time before the fixated location leaves the sight. If one is absorbed by the experience, an automatic forward to the relative position of the initially fixated position is executed again and again. Such a process of pursuing fixations (slow phases) and saccadic eye-movements (fast phases) is called optokinetic nystagmus (Knapp et al., 2013). The two distinct types of phases alternate and produce a saw-tooth-like pattern in the gaze trace (Figure 1.2). Rotary movements of objects can be tracked by the OKN as well: In the slow phases of the eye, the velocity is as fast as the viewed motion and resets in its orbit during the fast phases. The slow phase consists of two separate components: the initial rapid rise in eye-velocity (mediated in the flocculus) and a velocity memory mechanism to maintain the velocity of the observed motion (mediated in the nucleus of the ). The fast phases are connected to brainstem burst neurons in the reticular formation (Bense et al., 2006; C. Chen & Young, 2003). Many theories in psychology distinguish between processes that are automatic and inflexible. Such reflexive mechanisms, like the OKN, are well understood and have a relatively simple neural implementation. On the other hand, consciously willed mechanisms, like reported perceptual switches, are highly adaptable and require complex neural interplay. The interwoven volitional and automatic processes can be observed in oculomotor control experiments (Hugrass & Crewther, 2012). The OKN can be induced by many different stimuli motions, e.g., binocular rivalry (Enoksson, 1963) and moving plaids (Yo & Demer, 1992).

percepts: integrated segregated

Figure 1.2. Raw eye-movements and reported percepts (two): induced by the moving plaid stimulus, the eye-movements follow a sawtooth-pattern that characterizes the optokinetic nystagmus. Markedly different eye-movement properties along the percepts integrated and segregated train the SVM-models to finally decode the percepts automatically (see Study 2 and 3).

29 C1.3 Capturing the subjective perceptual experience

In Study 2 and 3, the OKN is induced by the moving plaid stimulus, and the bistable process can be tracked without a report. For binocular rivalry (Einhäuser, Methfessel, et al., 2017; Frässle et al., 2014) and moving plaids (Wilbertz et al., 2018), a high consistency between the behavioral report and the stimulus-induced OKN could be determined. For that purpose, SVM-models are trained with features extracted from OKN-data, where the momentary percept is reported (Figure 1.2). Instead, the reflexive OKN is decoded on a moment-by-moment basis into classes of the percepts integrated and segregated (Wilbertz et al., 2018). Together with the auditory percept (button-press), the direct coupling of percepts in two modalities can be monitored on a moment- by-moment basis. Another reflexive process that allows tracking of the perceptual dynamics is the constriction (miosis) and dilation (mydriasis) of the pupil at times of the perceptual switch and is investigated in Study 1.

Correlates of perceptual states: Pupil dilation Similarities in the empirical switching patterns (distribution of percept duration, switching direction) for multistable visual and auditory stimuli has nourished the presumption of a common mechanism (Hupé et al., 2008; Pressnitzer & Hupé, 2006, 2005), which leads to identical or similar physiological responses across different senses (Kleinschmidt et al., 2012). One easy-to- access psychophysiological indicator of neural activity in response to perceptual transitions in all senses has been the constriction and dilation of the pupil (Hess & Polt, 1964; Mathôt, 2018): The reported perceptual switch in perceptual multistability goes along with the momentary dilation of the pupil (Einhäuser et al., 2008). This coupling was shown for several forms of visual bistability (moving plaids, structure-from-motion, and Necker cube). In the same study, it was expected for the auditory modality (auditory streaming) as well, numerically in the expected direction, but not significant. Here, we examine the pupillary responses to the perceptual switches during auditory multistability in two new experiments (Study 1). Compared to the earlier investigation, the number of percepts is increased from two to four (see subsection 1.2). If consistent pupil responses were found in both modalities (vision and audition), this would strongly argue in favor of a partly shared mechanism governing perceptual multistability. Event-related pupil dynamics (e.g., constriction, dilation, and oscillation) has been identified as a sensitive marker to numerous cognitive processes (Einhäuser, 2017), which can be tracked preconsciously, insuppressible (Laeng et al., 2012) and independent from illumination (Einhäuser et al., 2010). Cortical and subcortical projections, potentially along the noradrenergic locus coeruleus pathway (Murphy et al., 2014), play a critical role in the control of the pupil size affected by sensory information processing (Larsen & Waters, 2018; Samuels & Szabadi, 2008). Two pathways control the pupillary diameter: The dilation of the pupil can be achieved by activating the radial muscle and the inhibition of the sphincter muscle

30 C1.3 Capturing the subjective perceptual experience

(Lowenstein et al., 1964). In reverse, inhibition of the radial muscle and activation of the sphincter muscle end to the constriction of the pupil. The radial muscle dilating the pupil is directly regulated by the sympathetic nervous system (posterior hypothalamic nuclei). The sphincter muscle constricting the pupil is controlled by the parasympathetic nervous system [Edinger-Westphal complex of the midbrain] (Steinhauer et al., 2004). Both neural pathways are sensitive to neurotransmitters that either increase or decrease neuronal activity in the pathways. Eventually, acetylcholine leads to a constriction of the pupil and inhibits the norepinephrine related activity. The neurotransmitter norepinephrine is discharged by the locus coeruleus in the pons of the brainstem and acts on receptors that drive the activity of the dilator muscle. The release of norepinephrine causes the dilation of the pupil, and inhibition of norepinephrine activity (e.g., by acetylcholine-release) subsequently lead to the constriction of the pupil. Activities in the sympathetic pathway (pupil dilation) can be caused by cognitive transitions like the modulation of uncertainty in a decision process (Steinhauer & Hakerem, 1992). Likewise, the inhibition of the parasympathetic pathway (weakening of pupil constriction) is signaling cognitive demand to a similar extent. Both the observable diameter of the pupil and its transitions seem to reflect the balance between the levels of norepinephrine [noradrenergic] and acetylcholine [cholinergic] (Beatty & Lucero-Wagoner, 2000). Deep neuronal sourcing in the brainstem, the cholinergic system, and the superior colliculus (Wang et al., 2015), which potentially contribute to attentional control structures, make the externally observable pupillary response a candidate for a universal marker of the fast fluctuating neurophysiological processes, that modulates the global brain state (Yu, 2012; Zekveld et al., 2018). Since the pupil was found to react to arousal (Bradley et al., 2008), memory recall (Kucewicz et al., 2018), decisions in general (Einhäuser et al., 2010), motor function execution (Hupé et al., 2009), novelty (Naber et al., 2013) and surprise (Preuschoff et al., 2011), it is necessary to control experimental observations for task-related processes during perceptual decisions. The coupling of the pupil to phasic activations along the brainstem system is not limited to sensory input based on the physical environment (i.e., loud noise, emotional faces, pain) and decision-making (i.e., motoric, calculation, memory-recall). Internal and external perceptual switches and their detection evoke dilations of the pupil (Kloosterman, Meindertsma, van Loon, et al., 2015). The neuronal outcome of brainstem activity is potentially coupled to the activity in the early modulations of the auditory cortex during perceptual switches, just as in the early visual cortex (Mazzucato et al., 2015). In Study 1, the effect the auditory perceptual switch has on the dilation of the pupil is compared in a report condition for four different percepts during ambiguous stimulation (for an example trace, see Figure 1.3). The pupillary response is cross-checked for external perceptual (actually sensory changes) and motoric influences (manual report-related). Since the dilation and constriction of the pupil are thought to be grounded in central units of the

31 C1.3 Capturing the subjective perceptual experience brainstem (Hulse et al., 2017), the auditory perceptual switch is expected to have a global component, as it was shown earlier for visual perceptual switches (Einhäuser et al., 2008; Laeng et al., 2012). Together with corresponding findings in the visual modality, pupil dilation may function as a universal marker of perceptual decisions (Einhäuser et al., 2008) and a detector of supramodal transitional brain states in perception (DiNuzzo et al., 2019).

percepts: integrated segregated_both segregated_high segregated_low none

Figure 1.3. Raw pupil diameter trace and reported percepts (four): during the stimulation with the ambiguous auditory streaming stimulus for a block duration of 240 s, the participant reported the four percepts. Around the beginning of each new percept, the pupil trace is extracted and aggregated (see Study 1).

We investigate in two auditory (Study 1), one visual (Study 2), and two audiovisual experiments (Study 2 & 3) how perceptual states and perceptual dynamics translate into reflexive eye movements (OKN) and automatic pupillary responses (Table 1.2). Since both the OKN and the pupil dilation are wired to brainstem regions and can signal perceptual transitions, this might hint at a shared global mechanism governing perceptual multistability. In Study 1, the coupling of reported and unreported internal and external perceptual switches on dilations in the pupil is re- reviewed. The perceptual switches either stem from physical changes in the sensory information (replay condition) or spontaneous perceptual switches in cognitive representation (multistable condition). We complement earlier findings that showed the sensitivity of the pupil dilation to visual perceptual switches. In Study 2, we explored the global vs. distributed hypothesis by inducing bistable perceptual switching in both modalities, audition and vision. If there is a global, higher- order cognitive function mediating the perceptual multistability, the percepts should be consistent crossmodally on a moment-by-moment basis. In Study 3, finally, the strength and directionality (over one shared cognitive structure) were examined. Externally induced percept states in one modality then translate into a consistent perceptual switch and state in the other modality. We tested that for the visual-on-auditory bistable configuration.

32 C1.4 Limitations of preceding studies, objectives, and outline of the Thesis

direct behavioral measure: indirect correlates: button-press reflexive eye-movements Study 1 Experiment 1 4 percepts pupil dilation (unimodal) (auditory percept) (auditory percept switch) Experiment 2 4 percepts pupil dilation (unimodal) (auditory percept) (auditory percept switch) Study 2 Experiment 1 2 percepts OKN (bimodal) (auditory & visual percept) (visual percept) Study 3 Experiment 1 2 percepts OKN (unimodal) (visual percept) (visual percept) Experiment 2 2 percepts OKN (bimodal) (auditory & visual percept) (visual percept) Table 1.2 Direct and indirect measures of perceptual dominance. The respective number of instructed percepts is denoted. For the indirect correlates, the tracked perceptual process (auditory or visual) is stated.

C1.4 Limitations of preceding studies, objectives, and outline of the Thesis

The main objective of the Thesis is to contribute to the ongoing discussion of perceptual multistability to be controlled globally (domain-general, Horlitz & O’leary, 1993; Kelso, 1995; Leopold & Logothetis, 1999) or distributed (domain-specific, Borsellino et al., 1972; Köhler & Wallach, 1944). The evidence in favor of either hypothesis (global or distributed) is equivocal and was recently enriched in terms of theoretical complexity by the hybrid model integrating both empirical findings (Long & Toppino, 2004; Toppino & Long, 2005). At an individual level, visual perceptual switching rates correlated moderately and showed similar higher-order statistical characteristics (Borsellino et al., 1972; Devyatko et al., 2017). It was hypothesized that they stem from a purely stochastic process (De Marco et al., 1977), which was subsequently refuted (van Leeuwen et al., 1997). Observers have to only some extent voluntary control on the current percept (e.g., prolonging or shortening) across different forms of perceptual multistability and the limitations of control varied between modalities; what was seen as evidence for a (partly) shared or top-down mechanism (high-level, global). That was contrasted against the findings that allow the conclusion in favor of a rather distributed mechanism (Kornmeier et al., 2009): Some forms of visual multistability did not correlate and were sensitive to attentional relocations (Slotnick & Yantis, 2005). A unique neural substrate (e.g., brainstem) or a distinct network of isolated cortical

33 C1.4 Limitations of preceding studies, objectives, and outline of the Thesis areas going along perceptual switches could not yet be experimentally determined. In a neighboring field, it was discovered that cross-modal binding of in- and congruent emotional voice (audition) and face (vision) stimuli modulated the activation of the amygdala (Dolan et al., 2001). The amygdala is a supramodal convergence point for emotional information, such as fear and pain, which goes beyond multimodal binding, but encodes emotional states in arousal levels. The representation of perceptual objects is a complex give-and-take of higher (feedback to lower levels) and early sensory neuronal assemblies (feedforward competition) across the cortex (Pitts et al., 2007). The investigation of unimodal (separate auditory and visual) and bimodal (audiovisual) has not generated conclusive evidence. The absence in the individually strong correlations of the perceptual switching in most studies and the disparate volitional control may indicate a set of shared functional principles that are implemented separately in each modality and only couple if contextual information is available (Klink et al., 2012; Ouhnana et al., 2017). Conversely, a common functional mechanism was found along with the correlations of perceptual switches evoked by the separate stimulation with auditory streaming, verbal transformation, visual plaids, and reversible figure (Kondo et al., 2012). Dilations of the pupil to reported perceptual switches were found for the visual multistable stimuli structure-from-motion, moving plaids, Necker cube but not for the auditory streaming, potentially for methodological reasons (Einhäuser et al., 2008). We want to address these opposing findings by re-visiting the effect of perceptual switches in auditory multistability on the pupil in Study 1 with a set of control conditions: The dilation of the pupil at the time of the perceptual switch for several forms of visual (and auditory multistability) is a physiological marker of the involvement of transitions in the norepinephrine level in phasic bursts of central neuronal structures during perceptual transitions. A similar neuronal response to visual and auditory multistability of the locus coeruleus (brainstem) might signal a shared mechanism for perceptual actualization in both modalities. We stimulated observers with an auditory streaming paradigm (configuration “ABAB...”) and recorded their pupil diameter. The individuals reported the percept switch and direction with buttons. A novel form of response verification was established to check the correct association between the instructed percepts (integrated, segregated_both, segregated_high, and segregated_low) and the four-way manual report.

In Study 2, auditory (auditory streaming, configuration “ABA_”) and visual (moving plaids) bistable processes were simultaneously triggered in both modalities. The auditory percept was directly signaled by button-press, and the visual percept was indirectly decoded from the OKN by an SVM model trained on data (OKN and reported percept) of the same individual. Some recent studies explored the challenges of parallel audiovisual ambiguous stimulation and discovered

34 C1.4 Limitations of preceding studies, objectives, and outline of the Thesis certain pitfalls in the dual-task: Percept inference and manual report overstrain some observers and evoked unwanted attentional effects. We tried to overcome those challenges by only having one manual report task (auditory percept) while the visual percept was read from the reflexive eye-movements. In this fashion, the consistency of the auditory percept (integrated or segregated) and the visual percept (integrated or segregated) could be investigated on a moment-by-moment basis. For checking the veridicality of the reported perceptual experience, a response verification procedure was established for the visual and the auditory percept, mimicking, with a disambiguated stimulus, the subjective experience in either percept, checking the attendance to the instructed modality. The stimulus parameters of the auditory streaming (frequency difference of the tones) and the moving plaids (opening angle) were individually tuned to facilitate coupling and allow fair statistical testing.

To test the properties (strength and directionality) of the multimodal coupling, we designed Study 3. Similar to Study 2, an individually tuned auditory streaming stimulus (“ABA_”) was combined with an untuned moving plaid stimulus (the opening angle remained fixed). Again, the visual percept was read out indirectly from the eye-movements (decoded OKN into two percepts: integrated, segregated). The auditory percept was directly signaled via pressing either one of two buttons (integrated or segregated). At selected time points during the parallel stimulation with the ambiguous stimulus material, one specific percept was externally induced only in the visual percept. The auditory percept was monitored to track the coupling with the visual percept. If the coupling is sufficiently strong and the coupling of percepts bidirectional or visual-lead, a likewise percept should be detected in the auditory percept.

35 C2.1.1 Experiment 1: Introduction

CHAPTER 2 Study 1

“Pupillometry in auditory multistability”

This chapter has been submitted in similar form for publication as: Grenzebach, J., Wegner, T., Einhäuser, W., & Bendixen, A., „Pupillometry in auditory multistability“ (PLOS ONE, in revision).

C2.1.1 Experiment 1: Introduction

Humans collect information about their environment through their sensory systems in order to successfully navigate in the world. Often, such sensory information contains ambiguities, such that the exact same sensory information can be interpreted in different ways (Attneave, 1971). In most cases, we do not take note of the ambiguity because our perceptual system decides in favor of the most probable or plausible interpretation, which then dominates perception (N. Rubin, 2003). However, when two or more alternative interpretations are about equally probable, one can observe the phenomenon of perceptual multistability (Leopold & Logothetis, 1999): During a prolonged phase of physically constant stimulation, perception alternates between the different interpretations. Experiencing one of the alternative interpretations is called a percept; the change from one percept to another is called a perceptual switch. The different percepts are mutually exclusive, but they are all individually compatible with the sensory information (i.e., each of them gives a valid explanation about the possible causes of the sensory information in the world). Following Leopold and Logothetis (1999), perceptual switching during multistability can be characterized by its randomness (with respect to percept duration, as well as with respect to switching direction in case of more than two alternatives), its exclusivity (only one percept is experienced at any given moment) and its inevitability (switching cannot be avoided). The switching from one percept to another has sparked great interest because it provides access to a case of changing perception without a corresponding physical change in the stimulus. The mechanisms that lead to perceptual switching have been studied from the perspective of computational modeling (Barniv & Nelken, 2015; Blake, 1989; Gigante et al., 2009; Mill et al., 2013; Rankin et al., 2015; Wilson, 2003) and in terms of the involved brain structures (Frässle et al., 2014; A Kleinschmidt et al., 1998; Kleinschmidt et al., 2012; Knapen et al., 2011). One

36 C2.1.1 Experiment 1: Introduction intriguing question is whether the same mechanisms contribute to perceptual multistability across different sensory modalities. Indeed, multistability has been shown to occur in many sensory modalities (Schwartz et al., 2012) and has been studied in great detail in vision (e.g., Brascamp, Sterzer, et al., 2018) and audition (e.g., Snyder et al., 2012). Here we consider a psychophysiological indicator that might accompany perceptual switching in both modalities alike: namely, the constriction and dilation of the pupil. The perceptual switch in multistability is accompanied by a momentary dilation of the pupil (Einhäuser et al., 2008). Einhäuser et al. (2008) showed such coupling for three forms of visual bistability (moving plaid, structure-from-motion, and Necker cube). They found numerically consistent results for the auditory modality (auditory streaming), but the pupil dilation effect failed to reach significance. Pupil dilations for visual multistability were confirmed and further confined by Hupé, Lamirel, & Lorenceau (2009). A more comprehensive test of pupil dilation effects in auditory multistability has not been conducted so far (Zekveld et al., 2018). Therefore, the current study was designed to examine pupillary dynamics during auditory multistability. We conducted two experiments combining a standard paradigm of auditory multistability, the auditory streaming paradigm, with pupillometry. The auditory streaming paradigm involves presenting a sequence of interleaved ‘A’ and ‘B’ tones (ABABABAB…) that differ from each other in frequency (Bregman, 1990; van Noorden, 1975). This stimulus configuration can be interpreted as originating from one sound source producing tones of two different frequencies (the integrated percept) or from two sound sources with distinct frequencies (the segregated percept). In the case of perceptual segregation, we additionally asked which of the sound sources – both sources, only the one producing the high sounds, or only the one producing the low sounds – are perceived in the foreground (Einhäuser, Thomassen, & Bendixen, 2017). The multistable nature of perception of various versions of this auditory stimulus has been well characterized (Denham et al., 2012, 2014; Pressnitzer & Hupé, 2006). Finding pupil dilation in response to a perceptual switch during auditory multistability (as observed before for visual multistability by Einhäuser et al., 2008) would speak in favor of partly overlapping mechanisms for multistability in vision and audition. Yet, because pupillary changes can be caused by numerous perceptual and cognitive processes (Einhäuser, 2017; Hupé et al., 2009; Laeng et al., 2012; Mathôt, 2018), it is important to rule out other factors that might cause a pupil dilation around the time of a perceptual switch. For this purpose, we developed control conditions to help us disentangle the various processes at stake during multistable perception. We are faced with the following situation: for observing a perceptual switch during unchanged physical stimulation, we have to rely on the participant’s overt report of that switch. Making the switch accessible to report requires introspection and metacognitive evaluation (Brascamp, Sterzer, et al., 2018), whereas the report as such involves decision-making, motor planning, and motor execution

37 C2.1.2 Experiment 1: Material and Methods

(Simpson & Paivio, 1968). These processes evoke pupillary components themselves, which might be additive (Hoeks & Levelt, 1993). Therefore, a pupillary response at the time of an overtly reported perceptual switch might be contaminated or overshadowed by the flanking report processes (Wierda et al., 2012), rather than being driven by the perceptual switch itself. To prevent such misinterpretations, we applied control conditions that involve pressing a report- button (like in the multistable condition) to compensate for the sensitivity of the pupil to motor actions (McCloy et al., 2017; Simpson, 1969). Specifically, in the random control condition, the participants press buttons at randomly chosen time points, without any temporal relation to perceptual switching (Farkas, Denham, Bendixen, & Winkler, 2016; Hupé et al., 2009). This gives a direct control for the multistable experimental condition, in which participants press buttons contingent upon their perceptual changes during ambiguous input. In a further control condition, the replay condition, participants likewise press buttons in response to perceptual changes, but this time the changes are exogenously elicited by stimulus changes, rather than caused by endogenous re-interpretations of physically unchanging input (see Brascamp, Becker, et al., 2018, for a detailed discussion of replay conditions in visual multistability). The stimulus changes constitute a direct replay of the participant’s behavioral switching dynamics in the multistable condition, transformed into disambiguated versions of the stimulus (Einhäuser et al., 2008). Experiment 1 involves these three conditions (multistable, replay, and random). In each of the conditions, we inspect the average pupillary response around the button press. If the pupil dilates specifically in response to the perceptual switch during multistable perception of ambiguous auditory input, we expect the following result pattern: • Hypothesis 1: The pupil dilates in response to, or slightly preceding, a perceptual switch in the multistable condition (Einhäuser et al., 2008). • Hypothesis 2 (motor control): The pupil dilation to a perceptual switch (reported by button press) in the multistable condition is larger than the pupil dilation to a button press without perceptual switch in the random condition (Hupé et al., 2009). • Hypothesis 3 (physical change control): The pupil dilation to an endogenous perceptual switch in the multistable condition is larger than to an exogenous perceptual switch in the replay condition.

C2.1.2 Experiment 1: Material and Methods

Participants 20 healthy volunteers (age [mean ± standard deviation]: 26.4 ± 3.33 years; 10 male, 10 female; 3 left-handed, 17 right-handed), who were recruited from the university community, gave their

38 C2.1.2 Experiment 1: Material and Methods written consent to take part in this experiment after being informed about the experiment, yet staying naïve regarding the hypotheses. Data collection for two additional volunteers was commenced but had to be cancelled for technical reasons. Participants received monetary compensation or course credit for the ca. 1.5h long experiment. All procedures conformed to the principles laid out in the Declaration of Helsinki and were determined by the applicable body (Ethikkommission der Fakultät für Human- und Sozialwissenschaften, TU Chemnitz) not to require in-depth ethics evaluation (case no. V-219-15-AB-Modalität-14082017).

Setup We conducted the experiment in a testing chamber (“Type 120a double-walled”, IAC Acoustics, UK), which is a dark room (no light source other than the monitor) with sound-absorbing walls [ambient noise level ≤ 16.9 dB(A)]. A 24-inch display (“VG248QE”, Asus; Taiwan) was positioned outside of the chamber and was visible through a multi-layer glass window. We measured the pupil diameter and gaze position of the right eye at 1,000 Hz sampling rate with the “EyeLink- 1000” camera/infrared-system (SR Research, Canada) positioned inside the chamber in Desktop mount configuration. The participant rested with the head in a chin and forehead rest (for optimal, stable viewing point of the eye-tracking camera). Participants gave manual responses via a four-button box, “Blackbox” (The Black Box ToolKit Ltd., UK), in front of them. The button box was connected over USB with the “Display PC” used for presenting the stimuli and over the parallel port with the PC used for recording the eye data (EyeLink’s “Host PC”), providing a high degree of synchrony with the pupil signal. The four buttons were positioned on the four edges of an isosceles trapezoid with the shorter “base side” towards the participant. The participants were seated with their fingers resting on the four buttons. They were instructed to use the index fingers of both hands for the two nearer buttons on the button box and the middle fingers to actuate the two farther buttons. The auditory stimuli were presented binaurally via “HD25-1 II”, 70 Ω, headphones (Sennheiser, Germany) over the soundcard “Sound Blaster X-Fi Titanium HD, SB1270” (Creative, USA) inside a high-performance stimulus-PC running Arch Linux, MATLAB R2015b (MathWorks, Natick, MA) and “Psychophysics Toolbox” (Brainard, 1997) with EyeLink-toolbox (Cornelissen et al., 2002). The Display PC was used for experiment control (e.g., instructions; EyeLink-control), auditory and visual stimulus presentation, and behavioral response acquisition (button presses). The ambient noise level and experimental sound levels were externally verified with “Type 2270” hand-held sound level meter and “Type 4101-A” binaural microphones (both by Bruel & Kjaer, Denmark). The timing of the sounds was externally verified with an oscilloscope and recorded internally with a jitter of sound start time of less than 25 microseconds (observed over 1200

39 C2.1.2 Experiment 1: Material and Methods played sounds). Sound start and end were communicated from the Display PC to the EyeLink Host PC over the parallel port for redundant recording.

Auditory stimuli The auditory sequence was a combination of two sinusoidal sounds (A and B; duration 230 ms each), which were played alternatingly with a 20 ms break in between: sound A, 20 ms silence, sound B, 20 ms silence, and so on. This resulted in a stimulation rate of four sounds per second. The same stimuli were presented to both ears. Each sound was gated with 5-ms raised-cosine onset and offset ramps to avoid acoustic artifacts in the headphone evoked by the sudden deflection of the sound membrane. The frequency trajectories of A and B as well as their sound levels varied according to the condition. The multistable condition employed constant-frequency sounds (A: 400 Hz, B: 712 Hz, i.e., 10 semitones apart) at a level of 75 dB(A). At the end of each block of the multistable condition, the sound level of either A or B was lowered to 50 dB(A) in some segments to verify participants’ responses (see Procedure). The random condition employed the same sounds as the multistable condition without the response verification segments in the end. The replay condition employed changing-frequency sounds (chirp sounds) with a middle frequency of 400 Hz (A) and 712 Hz (B). The chirp sounds consisted of three parts (see Figure 2.1A): First, a ramp (75 ms) starting with an “initial frequency”, which was increased linearly to the second (middle) frequency which remained constant for 80 ms, and third, a ramp (75 ms) where the middle frequency was linearly decreased back to the initial frequency. For replay stimuli designed to mimic the integrated percept (chirps 1 and 2, see Figure 2.1A), the initial frequency was the geometric mean between the two middle frequencies (534 Hz), such that the A and B tones both started at 534 Hz, changed towards 400 Hz (A) or 712 Hz (B), and changed back to 534 Hz (audio track in Supplement S5B). For replay stimuli designed to mimic the segregated percept (chirps 3 and 4, see Figure 2.1B, Supplement S5C), the initial frequency was 300 Hz for A tones and 950 Hz for B tones (i.e., 5 semitones below 400 Hz and 5 semitones above 712 Hz, respectively). Chirps 1 to 4 were presented at a level of 75 dB(A). For additionally mimicking foreground-background perception in the segregated case, chirps 3-soft and 4-soft (see Figure 2.1C-D) were identical to chirps 3 and 4 but presented with lower levels of 50 dB(A). Chirps 3 and 4-soft were thus combined to mimic segregated-foregroundA (Supplement S5D), chirps 3- soft and 4 were combined to mimic segregated-foregroundB (Supplement S5E), and chirps 3 and 4 were combined to mimic segregated-both in the foreground (Supplement S5C).

40 C2.1.2 Experiment 1: Material and Methods

Figure 2.1. Sound design for the replay condition. A) Sounds mimicking the integrated percept (“int”) consisting of repeating chirps 1 [sound A] and 2 [sound B]; with the drift of frequency inside the sound. B-D) Sounds mimicking the segregated percepts. B) “seg_both” (chirps 3 [tone A] and 4 [tone B]); C) “seg_low” (chirps 3 [tone A] and 4-soft [tone B]) D) “seg_high” (chirps 3-soft [tone A] and 4 [tone B]). Note that the spacing is uniform on a log scale (i.e., the frequencies represent steps of 5 semitones), while the sweep itself is linear in frequency, hence the slight curvature in log space. Chirps with reduced level [50dB(A)] are depicted in gray (chirps 3-soft and 4-soft).

Visual stimuli During each type of auditory stimulation, the display showed a light gray fixation square (luminance: 88.1 cd/m²; edge length: 3.1 degrees of visual angle) in the center of a black screen. Participants were asked to direct their gaze inside the fixation square during all times.

Procedure The experiment was divided into eight blocks: three blocks of the multistable condition (block numbers 2, 4 & 6; each 5 min), each of which was directly followed by a replay block (block numbers 3, 5 & 7; each 4 min). The first and the last block of the experiment belonged to the random condition (block number 1 & 8; each 4 min). Individual training blocks were administered before the first appearance of the multistable (2) and the replay (3) blocks, respectively. Before each block, the eye-tracker was calibrated with a 9-point-calibration (covering the center part of the display). Any new block type was (re-)introduced by a written instruction on paper. Instructions (e.g., mapping of the buttons) were reiterated by the experimenter after participants had read the paper instruction, and were summarized on the display before the block start. The start of each block was in control of the participant. After each block, the door of the chamber was opened, and the participant had the chance to pause.

41 C2.1.2 Experiment 1: Material and Methods

Multistable condition. For a duration of 4 min, the constant-frequency sounds A and B were presented at a level of 75 dB(A), and participants were asked to indicate their subjective perception continuously (Supplement S5A, first 4 min). The instruction emphasized the subjective nature of perception and introduced the possible percepts and the corresponding buttons as follows. The integrated percept (abbreviation: “int”) was defined as both sounds forming one continuous stream (ABAB…, see panel A of Figure 2.2). The segregated percept was defined as the two sounds forming two separate streams, none of which is perceived as dominating (A_A_... and B_B_..., “seg_both”, panel B). The segregated percept with the low stream in the foreground (A_A_..., “seg_low”, panel C) and the segregated percept with the high stream in the foreground (B_B_..., “seg_high”, panel D) were defined as the two sounds forming two separate streams, of which either the low or the high stream is perceived as dominant. The percepts were illustrated with the pictograms shown in Figure 2.2.

Figure 2.2. Original depiction taken from the paper instruction; Dots: single sounds; Lines: perceptual organization (solid: foreground; dashed: background); A) Integrated percept (“int”); B) Segregated percept, both streams in the foreground (“seg_both”); C) Segregated percept, low stream in the foreground (“seg_low”); D) Segregated percept, high stream in the foreground (“seg_high”).

To rule out any effects of manual difficulty or finger preference, the assignment of the four perceptual alternatives to the four response buttons varied between participants. The “int” and “seg_both” percepts were always paired on one side (but switching on the lower/higher buttons) of the button box, and the “seg_low” and “seg_high” percepts were paired on the other side of the button box, with “seg_low” always being on the lower button for reasons of feature-response compatibility. This results in four unique combinations of button-response mappings, which were counterbalanced across participants. Participants were instructed to press and hold the button as soon and as long as they experienced the respective percept. They were asked to release the button when a perceptual switch occurred and to press the button corresponding to the new percept as soon as they had identified the percept. If they could not categorize their momentary perception into one of the four alternatives or if they were confused, they were asked to release all buttons until a percept mapping to one

42 C2.1.2 Experiment 1: Material and Methods of the alternatives would return. It was highlighted that perception of such sequences is a highly subjective process, and that they should not attempt to actively manipulate the process. After the 4-min ambiguous part of the multistable condition, a 1-minute response verification part was administered. In this part, segments with reduced level [50 dB(A)] of either one of the tones were presented in random succession with segments of full level [75 dB(A)] for both tones (Supplement S5A, last minute). The three types of segments alternated randomly and continuously with a varying duration of 5 to 9 s for each segment. Lowering the sound level of one tone type (A or B) was expected to lead to a segregated percept with the other tone type (B or A, respectively) in the foreground, and participants’ responses could thus be checked for whether the correct button-response-mapping was employed. The recognition of tones A and B and their mapping to the correct button (A/B in foreground) was practiced in training blocks before the first block of the multistable condition (i.e., before block 2 of the experiment). The purpose of the training blocks for the multistable condition was to familiarize the participants with their multistable perception of the ambiguous sequence and to help them memorize the mapping of the buttons to the four different perceptual alternatives. The first training block (2 min duration) focused on training to distinguish and classify the low (A) and high (B) tones. For this purpose, segments with only one of the tone types present (i.e., only A or only B tones) were presented in alternation with ambiguous segments [both A and B at 75 dB(A)] with a varying length of 5 to 9 s for each segment. The second training block (3 min duration) was identical to the response verification segments of the experimental blocks: the “background” tone type was no longer absent but presented with reduced level. Again, three types of segments (reduced level of A, reduced level of B, or full level of both tones) alternated randomly with a varying length of 5 to 9 s for each segment. For online evaluation, training hit rate was based on segments with reduced level of A or B tones only (because any answer would have been correct for segments with full level of both tones, due to their ambiguity). Hit rate was defined as the summed time of exclusively pressing the correctly associated button (“seg_high” for reduced level of A, and “seg_low” for reduced level of B) divided by the total presentation time of segments with one tone type lowered in level. A maximum of 3 training blocks of each type were administered. If hit rate reached 80 % earlier, further training blocks were waived.

Replay condition. Each replay block played back, in a disambiguated manner, the participant’s own sequence of percepts in the ambiguous part of the immediately preceding multistable block. Disambiguation was implemented via chirp sounds (see Auditory Stimuli). Only valid button-press events (i.e., phases with the exclusive press of a single button, indicating a unique percept) of the multistable condition were replayed. No- or double-button presses were excluded; their cumulative times were distributed equally to the valid phases such that the overall duration of 4 min was

43 C2.1.3 Experiment 1: Data analysis kept. The order of the segments was shuffled for the replay relative to the multistable condition. The replay instruction emphasized that in the current task, the segments could be identified correctly. Participants were instructed to press and hold the button as soon and as long as they were presented with the replay segment corresponding to that button. Identification of each replayed percept and mapping to the correct button was practiced in training blocks before the first block of the replay condition (i.e., before block 3 of the experiment). For a duration of 2 min, the four replay segments alternated randomly with a varying length of 5 to 9 s for each segment. Training hit rate was defined as the summed up time of exclusively pressing the button that was correctly associated with the current replay segment, divided by the total presentation time of the respective replay segment type. A maximum of 3 training blocks was administered. If hit rate reached 80 % (separately for each of the four types of replay segments) earlier, further training blocks were waived. Before each training and experimental block, the four different replay segments were presented for 13 s each, along with a reminder of the corresponding label and button.

Random condition. In the first and the last block of the experiment, participants were asked to press the four buttons randomly while listening to the same sequences as in the ambiguous part of the multistable condition for a duration of 4 min. During the first block, participants had not received any instructions about the stimulus or the possible ways of perceiving it, whilst in the last block, they had such knowledge based on their experience with the multistable condition. The instruction for the random condition asked the participant to always hold only one button for several seconds before switching to one of the remaining three buttons, and to refrain from any planned order in the button sequence, as well as from any rhythm in the timing of presses and releases. The auditory sequence was explained to be irrelevant for the task.

C2.1.3 Experiment 1: Data analysis

Aggregation All dependent variables (pupil dilation, button-press duration, response verification and replay performance) were averaged in the following manner: All observations made for a participant and condition were aggregated from the different blocks, and then averaged within the participant. The within-participant averages were averaged across participants separately for each condition.

44 C2.1.3 Experiment 1: Data analysis

Response verification performance (multistable condition) Ambiguous segments were discarded from hit-rate analysis in the multistable condition because the answers could not meaningfully be classified as correct or incorrect. Further, unlike for online evaluation during training (see above), all segments with reduced level of one tone type (low or high) in which participants reported to perceive “int” or “seg_both” were discarded because participants’ post-experimental reports suggested that these responses corresponded validly to their subjective perception rather than being misclassifications. Thus, hit rate calculation in the response verification part was based on only those segments with reduced level of one tone type in which participants reported to perceive either “seg_low” or “seg_high”. For these cases, reporting “seg_high” when the high tones were in fact 25 dB lower in level than the low tones, or vice versa, was counted as a misclassification of the high and low tones. The hit rate thus was defined as the summed time of exclusively pressing the correctly associated button (“seg_high” for reduced level of A, and “seg_low” for reduced level of B) divided by the summed presentation time of segments with one tone type lowered in level in which either “seg_high” or “seg_low” was exclusively pressed. Participants’ data were excluded from all analyses when they did not reach 65 % on the thus calculated hit rate; this pertained to 1 out of 20 participants.

Replay performance Hit rate in the replay condition was defined as the summed time of exclusively pressing the correctly associated button divided by the summed time of all replay segments in which any button was exclusively pressed (four buttons available, one correct, three incorrect). Note that the number of percepts that had to be identified in a replay block depended on the reported percepts in the preceding block of the multistable condition; in some cases, certain replay segment types were not played at all. Hit rates are thus calculated by aggregating all segments irrespective of their identity. Participants’ data were to be discarded when they did not reach a hit rate of 65 %; no participant was affected by this in Experiment 1.

Button-press characteristics All analyses were based on exclusive button presses, which means that no further button was pressed at the same time. The last response before the block end (in the replay and random conditions) or before the start of the response verification part (in the multistable condition) was discarded because it was interrupted by a physical change or cessation of the tone sequence. The response verification part of the multistable condition was not included in any of the analyses described below. The absolute duration of each response type (i.e., percept in the multistable condition, identified segment in the replay condition, chosen button in the random condition) was calculated by

45 C2.1.3 Experiment 1: Data analysis summing up all exclusive press durations of the corresponding button. The percentage of each response type was then calculated by dividing its absolute duration by the sum of the absolute durations of all four response types. The median dominance duration per condition was calculated by taking the median of all exclusive button-press durations (aggregating all response types). The median durations were then averaged across participants. A repeated-measures ANOVA with the factor Condition (3 levels: multistable, replay, random) was conducted to compare the median dominance durations across conditions. Follow-up pair-wise comparisons between conditions were conducted via two-tailed, paired t-tests. Uncorrected p values are reported, and the Bonferroni-corrected alpha level is given. To quantify randomness of the button-press patterns, the coefficient of variation (CV) of all exclusive button-press durations was calculated separately for each condition. CV is defined as the durations’ standard deviation σ divided by the durations’ mean μ:

� �� = �

A repeated-measures ANOVA with the factor Condition (3 levels: multistable, replay, random) was conducted to compare the CV across conditions. Follow-up pair-wise comparisons were conducted via two-tailed, paired t-tests with Bonferroni correction.

Pupil data aggregation The eye-tracking data provided us with 1 sample of pupil diameter (in arbitrary units) per millisecond. The eye-tracker’s built-in software detected blinks and (threshold settings: 35°/s for velocity, 9500°/s2 for acceleration). These periods were marked as missing data in the trace of pupil diameters. We further discarded any pupil data obtained at gaze positions outside a central square with an edge length of 10° viewing angle. Moreover, we discarded the first 3 s after each block onset to remove the effects of sequence onset on the pupil diameter. No further filtering or smoothing was applied. For comparability reasons, the remaining pupil diameter (PD) data of each block was z- standardized. If μ is the mean and σ is the standard deviation over all included samples (PD(t), pupil diameter in arbitrary units) of the block, then the z-standardized pupil trace, zPD(t), can be calculated along all samples PD(t), as follows:

��(�) − � ���(�) = �

46 C2.1.3 Experiment 1: Data analysis

We aligned the trace of z-standardized PD values along all recorded button presses in the multistable and random conditions. In the replay condition, we selected only those button presses that constituted the first press after a physical stimulus change, and that happened maximally 2 s after the change. This ensures that we capture the initial response to a replay segment, as opposed to later responses that might reflect additional processes, such as response corrections after detecting an error, or multistable perception occurring after prolonged exposure to a replay segment. For each included button press, we cut out 2,000 samples before and 2,000 samples after the time-point of the button press from the PD trace. To prevent using data twice, we used only data that was closer to the button press under consideration than to any other analyzed button press. The remainder was assigned to the other button press and therefore treated as missing data for the epoch under consideration. This procedure ensures that while avoiding double use of data, the usable portion of the PD data is maximized around the time-point of the button press. The resulting PD traces (4,001 samples around each button press) were averaged along the time dimension separately per participant and condition, ignoring missing data on a sample-by-sample basis. As outlined above, missing data could result from blinks/saccades, gaze position outside the 10° square, beginning-of-block, or from avoidance of double use of data (see Supplement S1); and the available data was additionally limited by the number of button presses issued by the participant in the respective condition. If an individual participant’s PD trace in any condition contained a single sample (out of the 4,001) with zero PD traces contributing, this participant’s data was excluded from all analyses. No participant was affected by this exclusion criterion in the multistable and replay conditions of Experiment 1 (see below for more general issues with the random condition). Average PD traces were baseline-corrected separately per participant and condition by subtracting the average of the zPD of the first 200 samples (i.e., from -2,000 to -1,801 ms relative to the button press) from all 4,001 samples. These z-normalized and baseline-corrected traces are referred to as PD traces hereafter.

Maximal amplitude analysis To statistically compare the amount of pupil dilation around the button press across conditions, the maximal amplitude of the individual PD traces must be extracted per participant and condition. It is difficult to reliably assess the latency of the maximal PD amplitude, and thus the time-point for reading out the maximal PD amplitude (highest z value), from the single-participant traces due to their inherent noise level (especially for cases with low numbers of button presses or high proportions of missing data in some samples). To account for this difficulty, we used an approach from the field of event-related brain potential component analysis, which is faced with similar

47 C2.1.4 Experiment 1: Results signal-to-noise ratio issues: The jackknife approach has been identified as a suitable statistical resampling method that may show a better estimate of the peak amplitude than static averaging (Kiesel et al., 2008). By measuring the maximal amplitude on N sub-samples of N-1 participants (i.e., building averages by consecutively leaving one participant out), we obtained N maximal amplitudes per condition. Maximal amplitudes were compared across conditions as appropriate by a repeated-measures ANOVA with the factor Condition (3 levels: multistable, replay, random) or via two-tailed, paired t-tests for pair-wise comparisons. Both the F and the t values must be corrected to account for the jackknifing, as shown by Ulrich and Miller (2001) with � � � = and � = 012230435 (� − 1)8 012230435 (� − 1)

. Sample-wise analysis and adjustment for multiple comparisons In addition to the analysis of maximal amplitudes, the PD traces were examined on a sample-by- sample basis to characterize their temporal trajectories. This involved the comparison of PD traces (4,001 samples) against zero z-score (mean of the given data-set) for each sampling point (1 ms) with a two-tailed t-test. The alpha level correction to compensate for this repeated statistical testing was implemented with the false discovery rate (FDR) procedure introduced by Benjamini and Hochberg (1995). Based on the distribution of p-values in the given dataset, this procedure generates a corrected alpha level that lies between the uncorrected level of 0.05 and the Bonferroni adjustment, which would result in an overly conservative alpha level of 0.05/4,001. Throughout all statistical analyses, the expected FDR is set to 0.05. Significance is denoted whenever the p-value remains under the FDR-corrected alpha level, which is reported for along with each analysis.

C2.1.4 Experiment 1: Results

Performance during response verification and replay We quantified participants’ ability to distinguish the percepts and to use the correct button- response mapping by their hit rates in the response verification part of the multistable condition and by their hit rates in the replay condition. In the response verification part, one participant showed a hit rate of 54.1 %, which was below the threshold of 65 %; therefore, the following analysis is limited to the remaining 19 (all hit rates at or above 80 %) out of 20 participants’ data. The average hit rate of the remaining 19 participants was high (M = 92.4 % with a standard deviation of 4.8 %). In the replay blocks, the hit-rate average was likewise high (M = 90.9 % with a standard deviation of 6.6 %).

48 C2.1.4 Experiment 1: Results

Button-press characteristics Percentages of each response type are shown in Figure 2.3A. In the multistable condition (ambiguous part only), the reported percentages of each percept were 32.47 % for the integrated percept, 38.35 % for seg_both, 13.82 % for seg_low, and 15.36 % for seg_high. In the replay condition, the percentages of identified segments should be similar to those of the multistable condition, which is indeed the case, with 32.77 % for the integrated percept, 35.80 % for seg_both, 14.35 % for seg_low, and 17.08 % for seg_high. In the random condition, participants produced approximately equal pressing times for each button, which corresponds to the instruction (24.24 %, 25.15 %, 25.61 %, and 24.99 %). Median dominance durations collapsed across all response types are shown in Figure 2.3B. They are longest for the multistable condition (M = 15.3 s, SD = 13.1 s), somewhat shorter for the replay condition (M = 11.7 s, SD = 8.8 s), and much shorter in the random condition (M = 2.94 s, SD = 2.0 s). This was confirmed in the ANOVA on median dominance durations, which showed a significant main effect of Condition, F(2, 36) = 13.4, p < .001. Follow-up pair- wise comparisons showed significant differences between the multistable and random conditions, t(18) = 3.85, p = .001, as well as between the replay and random conditions, t(18) = 3.97, p < .001, but not (at a Bonferroni-corrected alpha level of 0.05/3) between the multistable and replay conditions, t(18) = 2.11, p = .049.

Figure 2.3. A) Percentages of exclusive integrated percept (bottom), exclusive segregated percept with both tones in the foreground (2nd from bottom), exclusive segregated percept with low tone in the foreground (2nd from top), exclusive segregated percept with high tone in the foreground (top). In the random condition, button assignments are arbitrary. B) Median dominance duration per participant and condition, mean and standard error of the mean across listeners. C) Coefficient of variation per listener; error bars depict mean and standard error of the mean across listeners.

The exclusive button-press durations showed a mean coefficient of variation (CV) of 0.82 (SD = 0.24) in the multistable, 0.91 (SD = 0.29) in the replay and 0.48 (SD = 0.31) in the

49 C2.1.4 Experiment 1: Results random condition. Randomness of the button-press patterns as quantified by CV was thus notably lower in the random condition, which was confirmed by a significant main effect of Condition in the ANOVA, F(2, 36) = 42.3, p < .001, and by follow-up pair-wise comparisons showing significant differences between the multistable and random conditions, t(18) = 7.14, p < .001, as well as between the replay and random conditions, t(18) = 6.98, p < .001. There was also a numerical difference between the CVs in the multistable and replay conditions, which just failed to fall below the Bonferroni-corrected alpha level, t(18) = 2.56, p = .02. A CV of 0 implies a perfectly regular pattern, whereas a CV approaching or exceeding 1 corresponds to independence between the timing of subsequent button presses. The low CV in the random condition indicates that participants’ button presses were more regular than instructed, and – unlike intended by design – did not produce a good control for the button-press patterns in the multistable and replay conditions. This undermines comparability of the pupil responses across conditions. In addition to the low variability, an even more problematic aspect was the unexpectedly high rate of button presses in the random condition: Despite the instruction to wait for several seconds, the median duration of exclusively holding one button was only 2.9 s. At such fast pace, the pupil response elicited by the button press might not show its full amplitude due to saturation effects, and it also becomes unfeasible to extract an interval of ±2 s around each button press without double use of data and without severe amounts of missing data at the edges (see Supplement S1). For these reasons, we excluded the random condition from the analysis of the pupil responses (for completeness, we refer the reader to the Supplement S2 for a descriptive analysis of the pupil traces in the random condition).

Pupil dilation Pupil data exclusion based on blinks/saccades, gaze position outside the 10° square, and beginning-of-block led to an amount of missing pupil data of altogether 10.7 % (multistable), 12.5 % (replay), and 14.7 % (random condition). In the PD traces around the button press, the amount of missing data was about equally distributed in the multistable and replay conditions, with a small peak right after the button press (see Supplement S1). In the random condition, the amount of missing data in the 4-s window around the button press was considerably larger towards the beginning and end of the analysis window due to pruning because of too close button presses (Figure 2.S1C). Following the criterion that a participant’s data would be excluded from all analyses if an average PD trace in any condition contained a single sample (out of the 4,001) without pupil data, including the random condition in the analysis would have led to the exclusion of three further participants’ datasets. Moreover, average pupil traces of the remaining participants would have been based on very few data points towards the edges of the analysis interval. This confirmed our assessment based on the behavioral data that the extraction of robust

50 C2.1.4 Experiment 1: Results pupil traces from the random condition would be unfeasible, which led to the aforementioned decision to exclude the random condition from pupil data analysis. In the analysis of the pupil data from the multistable and replay conditions, a distinct widening of the PD around the time of the button press was observed in both conditions (Figure 2.4). This pupil dilation effect was confirmed by a sample-wise comparison of the PD traces against the baseline, zero z-score. In the multistable condition, PD was significantly different from zero continuously at an FDR-corrected alpha level (p ≤ 0.045) from 1642 ms before the button press until the end of the analysis window (2,000 ms after the button press). In the replay condition, PD was significantly different from zero continuously from 574 ms before the button press until the end of the analysis window (p ≤ 0.034, FDR-corrected alpha level). A surrogate analysis (see Supplement S4) shows that the pupil dilation effect is abolished when randomly re-positioning the button presses, which rules out statistical artifacts to underlie this finding.

Figure 2.4. Pupil diameter between 2 s before and 2 s after the button press, z-normalized and subtractively baseline-corrected to 0 mean between [-2 s and -1.8 s]; solid lines: mean, shaded areas: standard error of the mean. Colored segments denote periods in which pupil size is significantly different from 0 at an expected FDR of 0.05; adjusted alpha level is denoted on top of the segment.

The pupil dilation effects in the multistable and replay conditions were compared in terms of their maximal amplitude extracted via jackknifing (Figure 2.5). Contrary to the hypothesis, maximal amplitude in the replay condition (M = 0.712, SD = 0.021) was significantly higher than in the multistable condition (M = 0.449, SD = 0.016). This was confirmed by a jackknifing-corrected t- test, tcorrected(18) = 3.80, p < .001.

51 C2.1.5 Experiment 1: Discussion

Figure 2.5. A) Jackknifed traces (19 per condition) and B) maximal amplitudes of the jackknifed traces.

C2.1.5 Experiment 1: Discussion

We obtained evidence for a perceptual switch during auditory multistability to be accompanied by a temporary dilation of the pupil. This confirms hypothesis 1 and corresponds to findings by Einhäuser et al. (2008) and Hupé et al. (2009) for visual multistability. The pupil dilation commences more than 1.5 s before the button press with which participants indicate their perceptual switch, which is earlier than observed by Einhäuser et al. (2008) and Hupé et al. (2009) for vision. A plausible explanation is that the time from the start of the perceptual transition to its overt report is typically longer in auditory than in visual multistability (Einhäuser, Thomassen, & Bendixen, 2017). This is because evidence has to be accumulated over some period of time due to the discrete nature of the tone presentation (only 4 tones per second in the present paradigm). It is important to examine whether the pupil dilation effect indeed relates to the perceptual switch, or whether it is driven by other processes that accompany the report of the perceptual switch. For this purpose, we had included a random control condition (Farkas, Denham, Bendixen, & Winkler, 2016; Hupé et al., 2009) and a replay control condition (Brascamp, Sterzer, et al., 2018; Einhäuser et al., 2008). Unexpectedly, the behavioral result pattern in the random control condition was so different from that in the experimental (multistable) condition that we had to refrain from analyzing the accompanying pupil response because it could not have been interpreted in a meaningful way. This precluded a test of hypothesis 2; hence, we cannot infer whether the pupil dilation effect observed during auditory multistable perception is merely due to motor processes (i.e., pressing a report button to indicate the perceptual switch). To be able to test this hypothesis, we developed a slightly adapted approach in Experiment 2 (see below) that strives for a higher level of comparableness between the button-press behaviors across conditions. 52 C2.1.5 Experiment 1: Discussion

The behavioral measures in the multistable and replay condition were similar by design, since participants reproduced their button-press behavior from multistable perception with the disambiguated replay stimuli. Small numerical differences between these conditions were nevertheless observed in the behavioral measures. A plausible explanation for these differences is that participants needed some time to choose a correct answer in the replay condition and that they sometimes noticed a mistake and changed their answer during a segment. This shortens the duration of the button presses, and it adds another source of variability and thereby increases the CV. Nevertheless, the effects were modest (not withstanding a Bonferroni-corrected statistical test), and thus the pupil dilation effects can meaningfully be compared to one another. Contrary to hypothesis 3, pupil dilation was larger in response to a perceptual switch caused by a physical change (i.e., in the replay condition) than to an endogenous re-interpretation of physically unchanging input (i.e., in the multistable condition). This result pattern was unexpected, and it is distinct from Hupé et al. (2009) who implemented a physical control condition (without the aspect of individual replay) and found no significant differences between this control condition and their (visual) multistability condition. Numerically, their effect was larger during multistability than during response to physical stimulus changes, as we had expected to find here. The higher pupil dilation during replay than during multistability observed here might be confounded by two different factors: First, it is possible that the pupil dilation to percept changes in the replay condition is contaminated by the pupil response elicited by the physical stimulus change and thereby enlarged when compared to a percept change during multistability with a physically unchanged stimulus. Indeed, distinct auditory stimulus changes have been shown to have a dilating influence on the pupil (Wetzel et al., 2016). Second, the pupil response in the replay condition might be larger than that in the multistable condition because it is less temporally smeared – that is, the button press might have a tighter (less variable) temporal relation with the actual perceptual change during replay than during multistability. The latter argument would be supported by the observation that the dilation starts more than a second later in the replay than in the multistable condition and then shows a steeper gradient towards the peak. It is important to note that our analysis of pupil responses is always fixed to the time point of the button press, not to the actual endogenous perceptual switch (multistable condition) or to the exogenously caused perceptual switch (replay condition) or to the decision to randomly press another button (random condition) – because these events are not overtly observable. Therefore, the explanation based on temporal smearing is difficult to test in a straightforward manner. We thus modified our paradigm to include an experimental test of the first explanation: By isolating the effect of the physical stimulus change on the pupil response in Experiment 2 (see below), we can correct the pupil response during replay for pupillary components evoked by this change.

53 C2.2.1 Experiment 2: Introduction

C2.2.1 Experiment 2: Introduction

Experiment 2 reproduced Experiment 1 with a fresh set of participants and with two advancements in the experimental paradigm, as detailed below. We expected to replicate the pupil dilation in response to a perceptual switch during auditory multistability (hypothesis 1) and to characterize this dilation in more detail than Experiment 1 allowed us to. For testing hypotheses 2 and 3 more rigorously, we introduced two major improvements. First, by careful adaptation of the instructions, we sought to evoke a button-press pattern in the random condition that is more similar to the other conditions to be able to compare the pupil responses in a meaningful way. This would allow us to address hypothesis 2, namely that the pupil dilation during auditory multistability exceeds the one associated with a mere button press (random condition). Second, by introducing an additional control condition, we sought to compensate for a possible contamination of the pupil response by physical stimulus changes during stimulus replay. More specifically, in Experiment 2, we applied two different replay conditions: replay-active, which is identical to the replay condition of Experiment 1, and replay-passive, which involves the same stimulus presentation as replay-active but without any overt response required from the participants. The new replay-passive condition allows us to estimate the effect of physical stimulus changes on the pupil response, which can then be subtracted from the pupil responses in the replay-active condition. This leads to a re-examination of hypothesis 3: If the unexpected result pattern observed in the comparison of the multistable and replay conditions was due to physical stimulus change effects, pupil dilation in the corrected replay condition should no longer exceed pupil dilation during multistability – if, on the other hand, it was due to temporal smearing, the result should remain unchanged.

C2.2.2 Experiment 2: Material and Methods

All procedures, setup and stimulus details were identical to Experiment 1 unless otherwise denoted.

Participants 26 healthy volunteers (age: 22.3 ± 3.01 years; 8 male, 18 female; 5 left-handed, 20 right-handed, 1 both-handed) participated in Experiment 2. Data collection for one additional volunteer was commenced but had to be cancelled for technical reasons.

54 C2.2.2 Experiment 2: Material and Methods

Setup To improve data quality, the eye-tracking camera and IR emitter were re-positioned relative to Experiment 1.

Visual stimuli During each type of auditory stimulation, the display showed a combination of bulls eye and cross hair as recommended by Thaler, Schütz, Goodale, and Gegenfurtner (2013) for experiments requiring stable fixation. The fixation target was presented in black in the center of the screen (luminance: 0.1 cd/m²; size: 1.2° degrees of visual angle). Participants were asked to direct their gaze at the fixation target throughout. The screen’s background luminance (on which the fixation cross was presented) was optimized separately for each participant by a procedure at the very beginning of the experiment. After EyeLink calibration (9 points covering the central area of the display), for a duration of 5 s each, the whole display emitted one of 13 luminance levels (ascending order): 0.1, 12.5, 25, 37.5, 50, 62.5, 75, 87.5, 100, 112.5, 125, 137.5, 304.0 cd/m2, whereby the participant fixated the fixation target in the middle of the screen. For each luminance level, the last second of the recorded pupil diameter was examined, and a luminance level was picked that was close to the midpoint of the pupil diameters evoked by the lowest and highest luminance presented. The aim of this procedure was to choose a luminance level for the experiment that would give an intermediate pupil diameter to prevent ceiling and floor effects in the pupil response.

Procedure The experiment was divided into nine blocks. Depending on the parity (even or odd) of the participant number, participants either passed the experiment along one of two orders (I or II). Both orders contained the same number of each block type, but in a different succession to prevent order effects. The first (1) and last (9) blocks belonged to the random condition, and blocks 2, 5 and 8 were always of the multistable condition. The positioning of the blocks of the replay-passive and replay-active conditions differed between orders I and II. In order I, blocks 3 and 7 were replay-active blocks (and thus blocks 4 and 6 were replay-passive blocks); complementary to that, in order II, blocks 4 and 6 were replay-active blocks (and thus blocks 3 and 7 were replay-passive blocks). As in Experiment 1, training blocks were situated before the first appearance of the multistable condition (block 2) and the replay-active condition (block 3 or 4). The four different button mappings from Experiment 1 were systematically crossed with the alternating block orders (I and II) to counterbalance all unique combinations.

55 C2.2.3 Experiment 2: Data analysis

Random condition. The change of the random condition relative to Experiment 1 pertained to the instructions given to participants and to a check of their fulfillment in the first block of the experiment. With the purpose of yielding median button-press durations longer than in Experiment 1 (and long enough to enable an analysis window of 4 s width), the instructions were shortened to focus the attention on the most important aspects, namely that participants should always hold only one button, and that they should hold it for several seconds before switching to another button. It was no longer mentioned that they should refrain from any planned order in the button sequence, nor from any rhythm in the timing of presses and releases, because these various requirements might have detracted them from the simple instruction not to change the buttons too quickly. The envisaged button-press duration of more than 4 s was additionally enforced by repeating the first random block once if the median button-press duration was below 4 s on the first attempt. In these cases, only the repeated block was used for further analysis. The last block of the experiment (i.e., the second block of the random condition) was not repeated, even if the criterion (median < 4 s) was violated.

Replay-active condition. The replay-active condition in Experiment 2 corresponded to the replay condition in Experiment 1 in all respects.

Replay-passive condition. The replay-passive condition was constructed in the same way as the replay-active condition, with independent random shuffling for the replay-passive and -active blocks that were based on the same block of the multistable condition. The core difference between replay-passive and -active pertained to task and instructions. In the replay-passive condition, participants were not asked to report the current segment. Instead, they were instructed to focus their attention on the auditory sequence and to maintain fixation. The physically different segments (int, seg_both, seg_low, seg_high) that would be played were introduced just as in the replay-active condition. The button box was removed from the participants’ access, and no further training (as for the multistable and replay-active condition) was applied.

C2.2.3 Experiment 2: Data analysis

The data analysis was identical to Experiment 1, with an extension for the corrected replay condition.

56 C2.2.4 Experiment 2: Results

Correction of the replay condition for effects of stimulus change The replay pupil response reported for Experiment 2 (denoted replay-corrected hereafter) is based on a subtraction of data from the replay-active and the replay-passive conditions. The PD traces in the replay-passive condition were aligned to the onset of the physical stimulus change. Segments from -2 s to +4 s relative to the stimulus change were extracted (ignoring missing data on a sample-by-sample basis as in the other conditions, see Experiment 1 for details), and were averaged within each participant to yield this participant’s average pupil response to a change in the auditory stimulus without an overt response. This single-subject average PD trace of 6,001 samples length was baseline-corrected by subtracting the average of the first 200 samples (i.e., from -2,000 to -1,801 ms relative to the stimulus change) from all 6,001 samples. The pupil response in the replay-active condition was extracted time-locked to the button press for each button press that met the criteria laid out for the replay condition in Experiment 1. Then the baseline-corrected average replay-passive trace was aligned with the stimulus change preceding this button press. (This stimulus change falls – by the defined criteria – within the 2s prior to the button press, thus the [-2 s to 4 s] interval of the replay-passive condition suffices for the correction.) The thus corrected traces were averaged and baseline-corrected as in the replay condition of Experiment 1. A comparison between uncorrected replay-active and replay-corrected is given in Supplement S3. This subtraction provided us with a corrected pupil trace for the replay condition, which was then analyzed as in Experiment 1 in terms of deviation from the baseline and in terms of comparison of the maximal amplitude (via jackknifing) between all three conditions (multistable, corrected replay, and random).

C2.2.4 Experiment 2: Results

Data exclusion Out of the 26 participants, 7 were excluded from further analysis because they met one or more exclusion criteria: one participant showed a hit rate of 21.6 % (well below the threshold of 65 %) in the response verification part of the multistable condition, four participants showed hit rates below 65 % in the replay condition (46.4 %, 52.4 %, 60.0 %, 64.4 %). For five participants, PD traces could not be averaged in at least one of the conditions because there was at least one single sample with zero PD traces contributing. The exclusion criteria were partly overlapping, leading to a loss of 7 datasets altogether.

57 C2.2.4 Experiment 2: Results

Performance during response verification and replay The remaining 19 participants showed a high hit rate of 90.4 ± 5.5 % in the response verification part of the multistable condition and a likewise high hit rate of 91.3 ± 6.1 % in the replay-active condition.

Button-press characteristics Percentages of each response type are shown in Figure 2.6A. In the multistable condition, the reported percentages of each percept were 33.3 % for the integrated percept, 39.3 % for seg_both, 12.0 % for seg_low, and 15.3 % for seg_high. In the replay condition, the percentages of identified segments should be similar to those of the multistable condition, which is the case as in Experiment 1, with 33.3 % for the integrated, 35.4 % for seg_both, 10.6 % for seg_low, and 20.7 % for seg_high. In the random condition, participants again produced approximately equal pressing times for each button (26.2 %, 23.7 %, 26.2 % and 23.9 %). Median dominance durations collapsed across all response types are shown in Figure 2.6B. As in Experiment 1, median dominance durations were slightly longer in the multistable condition (M = 19.9 s, SD = 15.3 s) than in the replay condition (M = 16.2 s, SD = 9.2 s). They were again shorter in the random condition (M = 11.3 s, SD = 5.4 s), but the difference was not as extreme as in Experiment 1, where median dominance durations in the random condition were almost four times lower (M = 2.94 s) than in Experiment 2. Nevertheless, a significant difference between conditions was confirmed in the ANOVA on median dominance durations observed in Experiment 2, F(2, 36) = 5.54, p = .008. Follow-up pair-wise comparisons showed differences between the multistable and random conditions, t(18) = 2.60, p = .02, as well as between the replay and random conditions, t(18) = 2.39, p = .03, both of which fall slightly above the Bonferroni-corrected alpha level. No significant difference between the median dominance durations in the multistable and replay conditions was observed, t(18) = 1.67, p = .11.

58 C2.2.4 Experiment 2: Results

Figure 2.6. A) Percentages of exclusive integrated percept (bottom), exclusive segregated percept with both tones in the foreground (2nd from bottom), exclusive segregated percept with low tone in the foreground (2nd from top), exclusive segregated percept with high tone in the foreground (top). In the random condition, button assignments are arbitrary. B) Median dominance duration per participant and condition, mean and standard error of the mean across listeners. C) Coefficient of variation per listener; error bars depict mean and standard error of the mean across listeners.

The exclusive button-press durations showed a mean coefficient of variation (CV) of 0.76 (SD = 0.47) in the multistable, 0.85 (SD = 0.39) in the replay and 0.40 (SD = 0.20) in the random condition. Randomness of the button-press patterns as quantified by CV was thus again notably lower in the random condition, which was confirmed by a significant main effect of Condition in the ANOVA, F(2, 36) = 14.12, p < .001, and by follow-up pair-wise comparisons showing significant differences between the multistable and random conditions, t(18) = 3.45, p = .003, as well as between the replay and random conditions, t(18) = 5.48, p < .001, but not between the multistable and replay conditions, t(18) = 1.12, p = .28. It should be noted that the low CV in the random condition of Experiment 2 does not indicate a violation of instructions because – unlike for Experiment 1 – the instructions did not ask participants to refrain from any rhythm in the timing of presses and releases. Importantly, the median interval between any two consecutive button presses in the random condition was now in a range that made the analysis of pupil effects feasible.

Pupil dilation Pupil data exclusion based on blinks/saccades, gaze position outside the 10° square, and beginning-of-block led to an amount of missing pupil data of altogether 7.6 % (multistable condition), 8.1 % (replay-active), 7.6 % (replay-passive), and 9.2 % (random condition). In the PD traces around the button press or stimulus change, the amount of missing data was about equally distributed, with a small peak right after the button press in the three conditions with response requirement (see Supplement S1).

59 C2.2.4 Experiment 2: Results

Figure 2.7 shows a distinct widening of the PD around the time of the button press in all three conditions (multistable, replay-corrected, and random). Replay-corrected refers to the correction of the pupil trace in the replay-active condition by the pupil-trace in the replay-passive condition (see Supplement S3). The pupil dilation effect was confirmed in all three conditions by a sample- wise comparison of the PD traces against the baseline, zero z-score. In the multistable condition, PD was significantly different from zero (p ≤ 0.043, FDR-corrected alpha level) continuously from 1240 ms before the button press until 2,000 ms after it, corresponding to the end of the analysis window. In the corrected replay condition, PD was significantly different from zero continuously from 659 ms before the button press until 2,000 ms after it (p ≤ 0.030, FDR- corrected alpha level). In the random condition, PD was significantly different from zero continuously from 691 ms before the button press until 2,000 ms after it (p ≤ 0.037, FDR- corrected alpha level). A surrogate analysis (see Supplement S4) shows that pupil dilation is abolished when randomly re-positioning the button presses, which rules out that statistical artifacts underlie the effect.

Figure 2.7. Pupil diameter between 2 s before and 2 s after the button press, z-normalized and subtractively baseline-corrected to 0 mean between [-2 s and -1.8 s]; solid lines: mean, shaded areas: standard error of the mean. Colored segments denote periods in which pupil size is significantly different from 0 at an expected FDR of 0.05; adjusted alpha level is denoted on top of the segment.

The pupil dilation effects in the multistable, replay-corrected and random conditions were compared in terms of their maximal amplitude extracted via jackknifing (see Figure 2.8). Maximal amplitudes were 0.59 (SD = 0.014) in the multistable, 0.84 (SD = 0.021) in the replay-corrected and 0.52 (SD = 0.014) in the random condition. The jackknifing-corrected ANOVA confirms a significant effect of Condition on the maximal amplitudes, Fcorrected(2, 36) = 7.06, p = .003. Follow-up pair-wise comparisons with jackknifing-corrected t-tests showed a significant difference

60 C2.3 Experiment 1 & 2: Discussion

between the replay-corrected and random conditions, tcorrected(18) = 3.32, p = .004, a difference that fails to meet the Bonferroni-corrected alpha level between the multistable and replay- corrected conditions, tcorrected(18) = 2.05, p = .03, and no significant difference between the multistable and random conditions, tcorrected(18) = 1.07, p = .30.

Figure 2.8. A) Jackknifed traces (19 per condition) and B) maximal amplitudes of the jackknifed traces.

C2.3 Experiment 1 & 2: Discussion

As in Experiment 1, we found evidence for a perceptual switch during auditory multistability to be accompanied by a temporary dilation of the pupil. This confirms hypothesis 1 and is in line with pupil dilation during visual multistability (Einhäuser et al., 2008; Hupé et al., 2009). Again, the pupil dilation commences more than a second before the button press with which participants indicate their perceptual switch, which might be due to prolonged evidence accumulation during auditory (as opposed to visual) perceptual decisions for such stimuli. In Experiment 2, we can relate the pupil dilation effect during auditory multistability to other processes that accompany the report of the perceptual switch. Specifically, pupil data from the random control condition (Hupé et al., 2009) could be meaningfully analyzed in Experiment 2 thanks to the longer durations of participants’ button presses (median 11.3 s as opposed to 2.94 s in Experiment 1). Analysis of the pupil data during random button pressing shows that reliable pupil dilation around the button press is elicited in this condition as well, and that the maximal amplitude of the pupil dilation does not differ from that observed during perceptual multistability. This disconfirms hypothesis 2 and is distinct from results of a similar comparison for visual multistability (Hupé et al., 2009), where the amplitude of pupil dilation during the motor-control condition amounted to only 70 % of the amplitude during multistability (i.e., there was a genuine component of the perceptual switch). In the worst case, the present result indicates that pupil 61 C2.3 Experiment 1 & 2: Discussion dilation during auditory multistability is simply an artifact of pressing a response button. Alternatively, our random control condition might have been too demanding for participants, who had to choose between four possible response buttons (obviously making efforts to press each of them for the same amount of time) while keeping the timing requirements and trying to produce random behavior. Such task demands would increase the pupil response at the time of the decision (Hess & Polt, 1964). Similarly, the decision as such may elicit a pupil dilation (Einhäuser et al., 2010; Simpson & Hale, 1969). In either case, the random condition would not be suitable for isolating the mere motor act but would come with its own deliberation and decision processes. Random button-press control conditions are not yet routinely applied in experiments involving auditory multistable perception (see Farkas et al., 2016, for an exception). In the present studies, we made an effort to keep the random (control) condition as similar as possible to the multistable (experimental) condition. Future studies should systematically compare different instructions for random button pressing with the aim of producing similar behavioral outcome while avoiding the cost of involving too many cognitive processes on top of pressing a button. Another way to separate button-press effects from perception and decision processes would be to use a no-report paradigm that allows for continuous monitoring of participants’ momentary percept without the need of behavioral reporting (Frässle et al., 2014). Yet since no-report paradigms for visual (Fahle et al., 2011; Naber et al., 2011) and auditory (Einhäuser, Thomassen, & Bendixen, 2017) multistability often rest on specific patterns (such as the OKN) obtained via eye-tracking, this might be challenging to reconcile with undistorted pupil dilation measurements. Importantly, in the present data, the significant difference in pupil dilation between the random and replay conditions clearly indicates that pupil dilation carries a genuine perceptual component for auditory stimuli. Behavioral measures in the replay and multistable conditions were statistically indistinguishable in Experiment 2. The comparison of the pupil data from these two conditions thus remains valid. Similar to Experiment 1, and again contrary to hypothesis 3, pupil dilation was larger to a perceptual switch caused by a physical change (i.e., during replay) than to an endogenous re- interpretation of physically unchanging input (i.e., during multistability). In Experiment 2, we corrected the pupil data of the replay condition by the pupil dilation caused by the physical stimulus change (through the passive-replay condition). Indeed, physical stimulus changes did produce a non-negligible confound on the pupil dilation amplitude (see Supplement S3), which we eliminate by our subtraction approach. The resulting replay-corrected trace reflects cognitive components that are shared with those reflected in the pupil trace from the multistable condition: namely, making a decision (i.e., classifying the new percept and identifying the button that is associated with it) and pressing a button. Since the multistable condition carries the additional component of an endogenous perceptual switch, we had expected to find higher pupil dilation

62 C2.3 Experiment 1 & 2: Discussion amplitudes here than during replay (such data pattern was observed numerically, though without a significant difference, by Hupé et al. (2009), for visual multistability). Since we can rule out possible confounds by physical stimulus changes in Experiment 2, we remain with the possibility that the pupil response in the replay condition is larger than that in the multistable condition because it is less temporally smeared. In other words, the interval from the perceptual change to the execution of the button press might be less variable during replay than during multistability because the perceptual transition is more distinct. As in Experiment 1, the steeper gradient of the pupil dilation during replay than during multistability supports this possibility. Future studies should explicitly address effects of temporal smearing and their possible influence on the pupil response during auditory multistability and stimulus replay. One way to do this would be to compare listeners with different degrees of training in reporting their auditory multistable perception, based on the idea that temporal smearing diminishes with training (Denham et al., 2014; Einhäuser, Thomassen, & Bendixen, 2017). With the amount of training, the button press should be better synchronized to the actual perceptual change. If the temporal smearing account is valid, the pupil dilation amplitude during multistability should thus increase with training. Another possibility to assess the effects of temporal smearing is to introduce gradual rather than abrupt changes during the replay condition to mimic the gradual nature of perceptual transitions during multistability. Here, the pupil dilation amplitude observed during replay should decrease. In both cases, if hypothesis 3 holds, the amplitude difference in the pupil dilation between multistability and replay should vanish and eventually reverse – which would indicate a genuine contribution of the perceptual switch to pupil dilation. As a side-note, the stimulus used for the replay conditions in Experiment 1 and 2 proved to be easily identifiable after a short amount of training for most participants (42 out of 46 across both studies). The disambiguation via chirp sounds might be an interesting option for other studies of auditory multistable perception using variants of the auditory streaming paradigm (van Noorden, 1975). Such studies are often in need of response validity checks or catch trials (Farkas, Denham, Bendixen, & Winkler, 2016). Catch trials are mostly implemented by extremely small or large frequency differences between the ‘A’ and ‘B’ sounds to promote integrated or segregated percepts, respectively. Yet such catch trials may not be advisable if the main experimental manipulation also includes a variation of ‘A’-‘B’ frequency difference because they might be too strongly suggestive of seemingly ‘correct’ behavioral responses during multistability. The disambiguation used in the present studies has the advantage of leaving the frequency difference of the core part of the sound intact, which might be a useful alternative for future studies.

Taken together, we conducted two studies that demonstrate a transient pupil dilation accompanying perceptual switches during auditory multistability. This complements previous

63 C2.3 Experiment 1 & 2: Discussion observations for visual multistability (Einhäuser et al., 2008; Hupé et al., 2009). Unlike for visual multistability (Hupé et al., 2009), it is not yet possible to decompose the pupil dilation effect during auditory multistability into different components reflecting perceptual versus motor processes. Based on our findings, we developed specific recommendations for future studies that would allow such decomposition by using improved control conditions. Isolating an unequivocal perceptual component of pupil dilation during auditory multistability would be important to interpret the pupil as a marker of shared mechanisms contributing to perceptual multistability across modalities (Einhäuser et al., 2008).

64 C2.4 Supplement Study 1

C2.4 Supplement Study 1

C2.4.1 Supplement S1: Missing data The core analysis of the two studies reported here rests on the examination of pupil diameter (PD) changes in a window from 2 s before to 2 s after each button press. As outlined in the main text, several factors can lead to missing data around a button press. These include blinks/saccades, gaze position outside the 10° square, beginning-of-block, and pruning to avoid double use of data. To assess the robustness of the data underlying the average PD traces, the time course of the amount of missing data per condition was examined (Figure 2.S1). Although the amount of missing data is substantial (ca. 20 %), this is typical of such paradigms with high task demands and fixation instruction, and the consistency of the sample-wise statistical testing indicates that results were not systematically distorted. There is a small peak of missing data right after the button press in most conditions, which is likewise typical. The only condition with a severe and unexpected amount of missing PD data was the random condition of Experiment 1. The massive data loss towards the edges of the analysis window was due to pruning because of too close button presses (median 2.9 s button-press duration relative to a 4 s analysis window). Consequently, the PD data from this condition were excluded from further analysis.

C2.4.2 Supplement S2: Pupil traces in the random condition of Experiment 1 To verify that the exclusion of Experiment 1’s random condition does not distort the overall results, the PD trace in this condition was analyzed and compared with the multistable and replay conditions on a descriptive level (Figure 2.S2). It is important to note that only from -1.246 s to +1.246 s relative to the button press, PD data are available for all 19 participants. The severe loss of data towards the edges of the analysis window in the random condition (see Figure 2.S1) leads to discontinuities in the PD trace that must not be interpreted as representing actual “jumps” in pupil size. Rather, it is due to individual participants not contributing anymore to the group- average PD trace, as well as to severe reductions in the number of epochs in those participants who do still contribute to the group average. The unavailability of PD data for some participants in the interval used for baseline correction (-2 s to -1.8 s) made a baseline correction unfeasible for the group-average PD trace of the random condition.

65 C2.4 Supplement Study 1

Figure 2.S1. Amount of missing pupil diameter data between 2 s before and 2 s after the button press (panels A-F) or stimulus change (panel G), separately for each condition in Experiment 1 (left column) and Experiment 2 (right column). Missing data refers to the percentage of individual pupil traces that had to be excluded from the pupil diameter analysis relative to the number of epochs that contributed at least one data point.

66 C2.4 Supplement Study 1

Figure 2.S2. Pupil diameter data on the random condition of Experiment 1 between 2 s before and 2 s after the button press (gray). Pupil diameter in the replay (red) and multistable (blue) conditions are plotted for comparison. Notation as in figure 4 of the main text. For the replay and multistable conditions, every participant contributes PD data to each time point. For the random condition, some single-participant PD traces are discontinuous towards the edges of the analysis window, leading to distortions of the group-average PD trace and rendering baseline correction unfeasible in this condition.

At a descriptive level, the pupil response in the random condition is shallower than in the multistable and replay conditions. This appears to be in line with hypothesis 2, but should not be interpreted as evidence with respect to this hypothesis. In addition to the above-mentioned issues with PD data availability, the fast pace of the button presses might lead to saturation of the pupil response, as it leaves insufficient time for recovery to baseline. This, in turn, might result in a smaller apparent response of the pupil to any given button press and thus confound the analysis. Hence, we refrain from inferences about the pupil response in the random condition of Experiment 1. Based on these observations, we modified the instruction for and the design of the random condition of Experiment 2.

C2.4.3 Supplement S3: Correction of the replay condition in Experiment 2 To separate effects of physical stimulus changes from effects of perceiving and responding to a stimulus change in the replay-active condition of Experiment 2, a replay-passive condition without response requirements was added as a control. Pupil data from the replay-passive condition were point-wise subtracted from the pupil data of the replay-active condition to yield the replay- corrected pupil trajectory (Figure 2.S3A, see main text for details of the subtraction procedure).

67 C2.4 Supplement Study 1

Figure 2.S3. Pupil diameter during in the replay-active condition (purple) and after correction by the replay-passive condition (red, ‘replay-corrected’ as used in the main text). A) Pupil diameter between 2 s before and 2 s after the button press, z-normalized and subtractively baseline- corrected to 0 mean between [-2 s and -1.8 s]; solid lines: mean, shaded areas: standard error of the mean; red line corresponds to Figure 7. B) Jackknifed traces (19 per condition), red traces correspond to Figure 8A. C) Maximal amplitudes of the jackknifed traces, red data points correspond to Figure 8B.

We compared the replay-active and replay-corrected pupil traces in terms of their maximal amplitude as extracted via jackknifing (Figure 2.S3B-C). This revealed a higher amplitude in the replay-active (M = 1.07, SD = 0.020) than in the replay-corrected condition (M = 0.84, SD = 0.021). This difference was significant as confirmed by a jackknifing-corrected t-test, tcorrected(18) = 3.93, p < .001.This difference between replay-active and replay-corrected shows that physical stimulus changes produce a non-negligible enhancement of the pupil dilation amplitude. Consequently, the correction applied in Experiment 2 for the analysis reported throughout the main text is of relevance to render the replay-condition an adequate comparison for the multistability condition, where there is no physical stimulus change.

C2.4.4 Supplement S4: Surrogate analysis The key finding of the two studies reported here pertains to pupil dilations around the timepoint of button presses. It is important to make sure that these pupil dilations do not reflect statistical artifacts, such as regular patterns in the underlying pupil trace that are linked to regular button- press patterns. This was tested by surrogate data generation, for which we randomly shuffled the button presses within each block. The timepoint of the first sound acted as the reference point. The number and duration of button presses was kept, but their order was shuffled, and pupil diameter traces were cut out relative to these surrogate button-press times. The traces were then processed exactly like the actual pupil traces (see main text). The shuffling procedure was repeated 100 times for each condition. This surrogate analysis shows that the pupil dilation effect is abolished when randomly re-positioning the button presses (Figure 2.S4). This visual impression was quantified by examining whether any of the 100 surrogate traces contained samples whose 68 C2.4 Supplement Study 1 pupil amplitude significantly differed from zero after FDR correction. The maximum number of traces in any condition with at least one significant sample was 4 out of 100, which can be regarded as spurious. Hence, we can rule out the possibility that statistical artifacts were underlying the pupil effects reported in the main text.

Figure 2.S4. Surrogate analysis for the pupil diameter data between 2s before and 2s after the button press, separately for each condition in Experiment 1 (left column, panels A and B) and Experiment 2 (right column, panels C through E). In each condition, the actual pupil effects (green traces, replotted from Figures 4 and 7) by far exceed the surrogate pupil effects (blue/red/black traces). This is true for every single one of the 100 surrogate traces.

69 C2.4 Supplement Study 1

C2.4.5 Supplement S5: Soundtrack of the auditory stimulus material This supplement can be retrieved from: https://www.tu-chemnitz.de/physik/PHKP/multistable/DissertationGrenzebach/index.html.

S5A. 4-min ambiguous auditory stimuli followed by 1-min response verification, duration 5 min URL: http://www.tu-chemnitz.de/physik/PHKP/multistable/DissertationGrenzebach/Studie1/S5A.mp4 File Format: mp4/MPEG-4-Audio Size: 4.7 MB

S5B. Disambiguated auditory stimulus: int segment (chirps for replay), duration 1 min URL: http://www.tu-chemnitz.de/physik/PHKP/multistable/DissertationGrenzebach/Studie1/S5B.mp4 File Format: mp4/AAC Size: 1 MB

S5C. Disambiguated auditory stimulus: seg_both segment (chirps for replay), duration 1 min URL: http://www.tu-chemnitz.de/physik/PHKP/multistable/DissertationGrenzebach/Studie1/S5C.mp4 File Format: mp4/AAC Size: 1 MB

S5D. Disambiguated auditory stimulus: seg_low segment (chirps for replay), duration 1 min URL: http://www.tu-chemnitz.de/physik/PHKP/multistable/DissertationGrenzebach/Studie1/S5D.mp4 File Format: mp4/AAC Size: 1 MB

S5E. Disambiguated auditory stimulus: seg_high segment (chirps for replay), duration 1 min URL: http://www.tu-chemnitz.de/physik/PHKP/multistable/DissertationGrenzebach/Studie1/S5E.mp4 File Format: mp4/AAC Size: 1 MB

70 C3.1.1 Experiment 1: Introduction

CHAPTER 3 Study 2

“Multimodal moment-by-moment coupling in perceptual bistability”

This chapter will be submitted in similar form for publication as: Grenzebach, J., Wegner, T., Einhäuser, W., & Bendixen, A., „Multimodal moment-by-moment coupling in perceptual bistability“ (manuscript in preparation).

C3.1.1 Experiment 1: Introduction

Ambiguous stimuli can evoke bistable or multistable perception in all sensory modalities (Schwartz et al., 2012). Visual examples include dynamic moving plaids (Wallach, 1935) and static figures like the Necker Cube (Necker, 1832); the most prominent auditory examples are verbal transformations (Warren, 1961; Warren & Gregory, 1958) and auditory streaming (van Noorden, 1975). Under prolonged stimulation with such physically constant, yet ambiguous sensory information, two (bistability) or more (multistability) perceptual interpretations alternate in awareness. The change from one dominating perceptual interpretation (percept) to another is called a transition or switch. The phenomenon of bi- or multistability opens a fascinating window into the interpretational character of perception: we do not simply process “raw” sensory information, but we constantly evaluate and (re-)evaluate it (Leopold & Logothetis, 1999). Due to a variety of similarities in the reported switching patterns across modalities (Pressnitzer & Hupé, 2006), a universal principle or shared mechanism producing multistable perception across modalities is under consideration (Long & Toppino, 2004; Sterzer et al., 2009; Walker, 1978). Here we look at how similar bistable perception is across modalities from a new perspective: we study moment-by-moment coupling of perceptual interpretations across modalities by asking participants to directly report their perception of an auditory bistable stimulus, while their perception of a visual bistable stimulus is indirectly measured via their eye movements. This allows us to investigate perceptual multimodal moment-by-moment coupling without response interference. Previous research linking several types of bi- or multistability within one modality or across modalities has often made use of inter-individual variation. There are huge inter-individual differences in certain properties of multistable perception (such as the rate of perceptual switching, 71 C3.1.1 Experiment 1: Introduction the average duration of experiencing a percept, or the preference of one percept over another (Denham et al., 2014). A number of studies have asked whether these properties are consistent when exposing one individual to several different types of multistability. By means of correlation techniques, they looked at whether the relative position of an individual (i.e., being a fast or a slow switcher compared to a reference sample) reproduces across a range of multistable phenomena. The outcome of these studies has been somewhat mixed (Denham et al., 2018; Kondo et al., 2018; Pressnitzer & Hupé, 2006), depending amongst others on the type of stimuli that are being compared and on statistical power. Yet, more and more studies point towards the existence of such correlations (Brascamp, Becker, et al., 2018; T. Cao et al., 2018; Einhäuser et al., 2020; Kondo et al., 2012; Pastukhov et al., 2019). Inter-individual correlations in perceptual bi- and multistability indicate that there are common factors affecting perceptual interpretations across stimuli and modalities. Yet, such common factors can be quite remote from the process of perceptual (re-)interpretation (Kleinschmidt et al., 2012): Individual consistencies in multistable perception have been linked to personality traits (Farkas, Denham, Bendixen, Toth, et al., 2016), neurotransmitter concentrations (Kashino & Kondo, 2012; Kondo et al., 2017), and genetic factors (Schmack et al., 2013). The observed correlations might thus reflect more general perceptual or decision tendencies of a person; they are not necessarily indicative of a specific perceptual mechanism that governs the switching process on a moment-by-moment basis, and that might or might not be shared across phenomena and modalities (Hupé et al., 2008). Only few studies specifically address whether there is a common mechanism of perceptual switching on a moment-by-moment basis (Hupé et al., 2008). Such studies usually present two or more multistable stimuli at the same time and ask participants to respond to both (or all) of them in parallel. Participants’ reports about their perception of each multistable stimulus are then screened for an above-chance co-occurrence of switches or of corresponding percepts (Adams & Haire, 1959; Flügel, 1913; Grossmann & Dobbins, 2003; Hupé et al., 2008; Ramachandran & Anstis, 1983; Toppino & Long, 1987). A typical observation is that co-occurrences are observed, but their frequency is much lower than theoretically possible, which speaks against a strong coupling induced by a strictly common mechanism (Kubovy & Yu, 2012). These prior studies all face the challenge of posing a dual-task requirement: Participants must monitor and report their perception for several stimuli at once. This involves interference at various processing levels, including non-perceptual stages such as decision and response interference. Another challenge lies in the difficulty to distinguish between joint perceptual switching driven by a common mechanism on the one hand, and perceptual grouping phenomena acting after a switch for one of the stimuli on the other hand (V. Conrad et al., 2012; Hupé et al., 2008). Further, the power to find moment-by-moment coupling between percepts in two

72 C3.1.1 Experiment 1: Introduction unrelated multistable stimuli is limited by independent differences in individual tendencies towards one or the other perceptual interpretation of either stimulus. In the current study, we overcome these challenges by simultaneously presenting two bistable stimuli (a visual and an auditory one) but asking for behavioral responses on only the auditory stimulus, while measuring perception of the visual stimulus via eye-tracking. We chose a visual stimulus whose perceptual alternatives are accompanied by markedly different eye movements, specifically the OKN. The OKN is characterized by a sawtooth-like pattern of alternating slow and fast eye movements, where the eye in the fast phases follows the currently perceived direction of motion. The employed visual stimulus was a visual pattern-component rivalry stimulus (Movshon et al., 1985; Wallach, 1935; this specific version similar to Wilbertz et al., 2018), where two overlaid drifting gratings are presented that can either be perceived as one passing in front of the other (“component” or segregated percept), or as a unitary plaid (“pattern” or integrated percept) (Figure 3.1B, movie in Supplement T1A). The motion direction of the components is distinctly different from the motion direction of the plaid, such that the OKN gives an unobtrusive, en passant readout of the current percept (Frässle et al., 2014). We combine this visual stimulus that lends itself towards readout, with the application of a support vector machine (SVM-) based classifier that allows for a moment-by-moment classification of the current percept from the eye movements (Wilbertz et al., 2018). The simultaneously presented auditory stimulus was chosen according to the classical auditory streaming paradigm (van Noorden, 1975). This involves two tones (A and B) arranged in a repeating “ABA_” pattern, with “_” denoting a silent gap equivalent to tone B in duration (Figure 3.1A). This arrangement can be perceived as coming from one sound source producing both the A and B tones (integrated percept, ABA_ABA_...), or from two sound sources producing the A and B tones in an interleaved manner (segregated percept, A_A_A_... and _B___B__...). Thus, in both modalities, there was an integrated (one visual plaid / one sound source) and a segregated (two visual components / two sound sources) interpretation. If a common mechanism of perceptual (re-)evaluation exists on a moment-by-moment basis, we should find co-occurrences of integrated and segregated percepts across modalities (Einhäuser et al., 2020; Pressnitzer & Hupé, 2006). To have enough room for finding such co-occurrences, it is important to rule out asymmetric distributions of the percepts between modalities. If the proportion of integrated/segregated percepts differs between the auditory and visual modality, this puts an upper limit to the times in which co-occurring percepts can be observed – even under ideal coupling, perfect (100 %) correspondence cannot be reached. To leave more room and thus increase statistical power for finding moment-by-moment coupling if there is any, we carefully pre-adjusted the stimulus parameters to have a 50/50 % distribution of integrated and segregated percepts, separately for

73 C3.1.2 Experiment 1: Results each individual and in each modality. The adjustments were based on changing the frequency difference between the A and B tones (Denham et al., 2013a) and on changing the enclosed angle (“opening angle”) between the gratings (Carter et al., 2014; Hedges et al., 2011; Hupé & Pressnitzer, 2012; Wegner et al., 2021). These parameters have been shown to affect the relative proportion of integrated and segregated percepts. With this stimulus configuration, which is tailored towards a fair test of multimodal moment-by- moment coupling in perceptual bistability, we investigate the hypothesis that integrated percepts and segregated percepts in the two modalities tend to co-occur. This, in turn, would be evidence for a partly common, supra-modal mechanism in control of perceptual bistability.

C3.1.2 Experiment 1: Results

In the main part of the experiment, participants (N = 25) were simultaneously presented with a multistable auditory streaming stimulus (Figure 3.1A) and a multistable visual pattern-component stimulus (Figure 3.1B, Supplement T1A). During the eight blocks of this bimodal condition, participants reported their auditory perception as either integrated or segregated (Figure 3.1C, D), while merely looking at the visual stimulus. Their visual perception was decoded from their eye movements with an SVM classifier that used a slightly adapted version of the procedure introduced in (Wilbertz et al., 2018). To train the SVM classifier and to evaluate its ability to decode the visual percept, prior to the bimodal condition, after the bimodal condition and after the fourth block of the bimodal condition, participants conducted a unimodal visual condition. In this condition, only the visual stimulus was presented and participants continuously reported their visual percept as integrated (plaid) or segregated (gratings). To re-accustom participants to reporting their auditory percept and to assess the effects of bi-modal presentation, unimodal auditory blocks either preceded or followed the first and the last unimodal visual block.

74 C3.1.2 Experiment 1: Results

Figure 3.1. Stimuli and paradigm. A) Auditory streaming stimulus, the frequency distance between tone A and B was adjusted individually between 1 and 14 semitones; fm denotes their geometric mean (534 Hz in all participants). B) Multistable visual stimulus. The opening angle was adjusted individually to a value between 35° and 55° (45° shown). C) Schematic of the two perceptual interpretations of the auditory streaming stimulus in A, as used for instructing participants. D) Schematic of the two perceptual interpretations of the visual stimulus, as used for instructing participants. E) Disambiguated auditory stimuli, “fA”, “fB” and “fm” match the individually adjusted tones in the multistable version (panel A), fA-Δ fB+Δ are chosen such that the distance in semitones (i.e., in log space) is identical, resulting in constant ratios in linear frequency space: fA/fA-Δ=fm/fA=fB/fm=fB+Δ/fB. F) Disambiguated visual stimuli. G) Order of blocks, order of unimodal blocks on day 2 is reversed in half of participants). H) Scheme of SVM training and testing; shaded boxes refer to reported results.

75 C3.1.2 Experiment 1: Results

In all conditions, the multistable presentation lasted 180 s (“multistable part”) and was immediately and seamlessly followed by a “disambiguated part” of about 30 s. In this part, participants continued to report as in the multistable part (i.e., their visual perception in the unimodal visual blocks, their auditory perception in all other blocks), but either one of the stimuli were disambiguated, the visual stimulus (Figure 3.1E, movie in Supplement T1B) in unimodal visual blocks and half of the unimodal auditory blocks, the auditory stimulus (Figure 3.1F, audio track in Supplement T2A, B) in the other blocks. The disambiguated stimulus alternated between suggesting an integrated perception and suggesting a segregated perception, such that in each disambiguated part each perception was suggested twice (for about 7.5 s). To ensure a roughly equal amount of segregated and integrated percepts in multistable parts in both modalities, the parameters of the visual and auditory stimuli had been adjusted for each individual in a separate session prior to the main experiment (for details see section Individual Adjustment of Stimuli; Figure 3.1G for an overview over the timeline).

Unimodal blocks: Response verification For multistable stimuli, by definition, the reported percept is subjective and there is no “correct” or “incorrect” response. For the disambiguated parts, in contrast, correctness is meaningful and responses can be validated whenever the participant reports on the disambiguated modality. The disambiguated parts in unimodal visual blocks and unimodal auditory blocks thus served for response verification. To be included in the main analyses, a participant had to report the correct percept for at least 70 % of the time they reported one percept exclusively in the response verification part (“hit rate”). Of the 25 participants, all fulfilled this criterion for the visual unimodal blocks (mean hit rate: M = 91.5 % (standard deviation, SD = 2.6 %), range 85.4 % - 95.4 %). In the auditory unimodal blocks seven participants failed to reach this criterion, while the remaining 18 were above the 70 % threshold (mean hit rate for N = 25: 76.4 % (SD = 14.0 %), range 46.4 %-89.8 %; mean hit rate after exclusion, i.e. for N = 18: 84.0 % (SD = 5.2 %).

Unimodal visual blocks: Decoding accuracy Each participant performed three unimodal visual blocks. Following Wilbertz et al. (2018), one of the three blocks was used to train the SVM classifier with different cost-parameter settings (11 models per block and participant), another to choose the optimal cost parameter, and the third to evaluate the performance of the model with the thus chosen cost parameter. By this setting, the training set, the set for choosing the cost parameter and the set used for evaluation are independent. The role of the three unimodal visual blocks was permuted within a participant, such that six different models per participant were retained for further analysis (Figure 3.1H). For each individual, we used each of these six SVM classifiers to predict the percept in the unimodal visual

76 C3.1.2 Experiment 1: Results block the specific classifier had neither been trained or optimized on. We then defined a class- adjusted classification accuracy, which first computes the accuracy for both classes (i.e., percepts) separately, then averages them, making the measure insensitive to bias in the ground truth (i.e., the reported percepts). When averaging over the six models per individual, mean accuracy across all participants was 82.9 % (SD = 9.9 %) and thus well above chance [t(24) = 16.67, p < .001; one-sample t-test against 50 %, Figure 3.2]. This conceptually replicates the finding that this SVM-method can be applied to robustly assess the percept of a multistable plaid-grating stimulus on a moment-by-moment basis (Wilbertz et al., 2018). While all participants showed decoding accuracies above chance (Figure 3.2), four participants remained below the a priori defined inclusion criterion of at least 70 % in this accuracy measure and were excluded from further analysis. Two of these had not met the 70 % criterion on hit rates either. Hence, 16 participants were retained for further analysis.

Figure 3.2. Individual decoding accuracy in unimodal visual blocks and exclusion criteria. Bars denote the mean over the six models, black circles the individual models. Data are sorted by mean accuracy; horizontal lines indicate chance level (50 %) and a priori exclusion criterion (70 %). Dark grey bars: included participants, open bars: participants excluded based on decoding accuracy, light grey bars: individuals excluded based on hit-rate; hatched bars: exclusion based on both criteria.

Unimodal blocks: Behavioral data In the unimodal visual blocks, participants reported one of the two percepts exclusively for 97.2 % (SD = 2.7 %) of the multistable part. On average the percentage of segregated percepts was about half (48.2 %, SD = 6.4 %) and not significantly different from 50 %, t(15) = 1.10, p = .29, with the individuals ranging from 35.6 % segregated to 61.7 % segregated (i.e., 38.3 % to 64.6 % integrated) reports. In the unimodal auditory blocks, participants reported one of the two percepts exclusively for 97.1 % (SD = 3.0 %) of the multistable part. On average the percentage of segregated percepts was 59.7 % (SD = 18.3 %) and not significantly different from 50 %, t(15)

77 C3.1.2 Experiment 1: Results

= 2.12, p = .051, with the individuals ranging from 32.8 % segregated to 90.1 % segregated reports. Together, this confirms that the adjustment procedure to yield about equal numbers of segregated and integrated percepts was successful in both modalities, though in the auditory modality there was a trend towards more segregation.

Bimodal blocks: Behavioral data In all bimodal blocks, participants reported their auditory percept throughout. In half of the blocks, the disambiguation in the last 30 s pertained to the auditory stimulus. Hence, these four parts can serve as additional response verification parts. Hit rate was 87.1 % (SD = 3.2 %) on average, which was slightly larger than the hit rate in the unimodal auditory condition for the 16 included participants (mean, M = 84.1 %, SD = 5.4 %, t(15) = 2.38, p = .031, paired t-test). This indicates that the presence of the visual stimulus does not disrupt the ability to report the auditory percept, if anything, there is a slight improvement. In the multistable parts, neither the proportion of segregated reports (M = 59.6 %, SD = 15.8 % compared to M = 59.7 %, SD = 18.3 %, t(15) = 0.06, p = .96) nor the median duration of each perceptual phase differed between unimodal auditory blocks (M = 7.99 s, SD = 5.38 s) and bimodal blocks (M = 8.36 s, SD = 6.37 s), t(15) = 0.35, p = .73. This indicates that the presence of a visual stimulus does not alter auditory multistability on average, but leaves open whether there is a moment-by-moment influence of one modality on the other.

Bimodal blocks: Relating the decoded visual percept to the auditory report In the bimodal blocks, participants reported their auditory perception while their visual perception was decoded using the SVM-classifiers trained on the unimodal visual blocks. To arrive at a measure of moment-by-moment consistency between visual and auditory perception, we used the SVM classifiers to “decode” the auditory percept. That is, we used the auditory report as labels for testing the SVM, as if they were referring to the corresponding visual percept – visual integration (plaid) to auditory stream integration, visual segregation (gratings) to auditory stream segregation. If the auditory and the visual percept were perfectly identical (whenever a participant reports auditory integration they perceive a plaid, whenever a participant reports visual integration they perceive a grating), this measure would be equal to the decoding accuracy in the unimodal visual blocks (Figure 3.2; Figure 3.3A, black lines). Conversely, if the visual and the auditory percept were entirely unrelated on a moment-by-moment basis, this measure would be at chance (50 %; here it is especially critical that the accuracies are first determined by class [percept] and then averaged, as otherwise the chance-level would be inflated if an individual has common biases in both modalities). We computed the consistency between reported auditory and decoded visual percept for each of the six models and eight bimodal blocks and then averaged these 48 values

78 C3.1.2 Experiment 1: Results within a participant. The resulting mean consistency across observers (53.5 %, SD = 6.1 %, range: 46.2 % to 65.7 %) was significantly above chance, t(15) = 2.33, p = .035. This demonstrates a moment-by-moment coupling between visual and auditory multistability.

Figure 3.3. “Decoding” of auditory percept from visual data. A) Bars: consistency between the reported auditory percept and the decoded visual percept, as measured by the “decoding” of the auditory percept using the visual model. Individuals are sorted as in Figure 3.2. Black line: decoding accuracy in the same individual of unimodal visual blocks (cf. Figure 3.2); blue line: hit rates of the individual in the response verification part of the multimodal blocks in which auditory report was probed; red line: maximal consistency expected from the combination of decoding accuracy in unimodal visual blocks and hit rates in unimodal auditory blocks (noise ceiling). B) Data of panel A plotted as fraction of the noise ceiling.

While the consistency between decoded visual and reported auditory percept is clearly above chance, its absolute value may seem comparably small. It should be noted, however, that the theoretical maximum for this value is considerably below 100 %. To compute this maximum, we make two assumptions: (i) errors in reporting the auditory percept are independent of decoding errors of the visual percept (Figure 3.2 and Figure 3.3A, black), and (ii) the hit rate in the auditory response verification parts of the multimodal blocks (Figure 3.3A, blue) is representative of an individual’s ability to faithfully report their auditory percept during multimodal stimulation. Under these assumptions, we computed the theoretical maximal consistency (“noise ceiling”, Figure 3.3A,

79 C3.1.3 Experiment 1: Discussion red). When expressing the consistency between decoded visual and reported auditory percept as fraction of this theoretical maximum – that is, using a linear mapping to the interval between chance (50 %, mapped to zero) and individual noise ceiling (mapped to 100 %), we find values between -24.3 % and 45.5 % (Figure 3.3B). That is, when taking the noise as consequence of imperfect reporting one’s auditory percept and imperfect decoding of the visual percept into account, the consistency between auditory and visual percept reaches up to 45 % of the expected theoretical maximum.

C3.1.3 Experiment 1: Discussion

We studied moment-by-moment correspondence of perceptual interpretations across modalities by having participants directly report their perception of an auditory bistable stimulus (auditory streaming (van Noorden, 1975), while measuring their perception of a visual bistable stimulus (pattern-component rivalry) indirectly via their eye movements (Frässle et al., 2014; Wilbertz et al., 2018). In line with the hypothesis of a shared mechanism contributing to multistable perception across modalities, our results demonstrate moment-by-moment coupling between visual and auditory bistability. The most important advancement relative to previous studies on moment-by-moment coupling in bistability (Adams & Haire, 1959; Flügel, 1913; Grossmann & Dobbins, 2003; Hupé et al., 2008; Ramachandran & Anstis, 1983; Toppino & Long, 1987) is the elimination of response interference. This was possible in the current study by measuring only one of the percepts behaviorally. With behavioral measurements of two or more percepts in parallel, it remains unclear whether the observed coupling is an effect of perception per se, or rather related to monitoring the percepts as per task instruction. This distinction is critical, as distinct neural substrates subserve either function (Frässle et al., 2014). Here, in the main (multimodal) condition, only one modality (audition) was task-relevant and reported, while the percept in the other modality (vision) was inferred from eye movements. Besides this distinction at the level of mechanisms (perception per se versus monitoring of perception), the indirect measurement of the visual percept is also advantageous from the perspective of task design. Giving behavioral reports for two or more stimuli in parallel is challenging for the participants and might undermine the reliability of their reports (V. Conrad et al., 2012). In addition to the challenge of a dual task, response mapping must be carefully chosen to avoid compatibility effects at the level of the response effector or response action (button press vs. release). Our main behavioral measurements involved reporting the auditory percept only. In additional unimodal visual blocks, we had participants report their visual percept (for choosing

80 C3.1.3 Experiment 1: Discussion suitable stimulus parameters and for training the SVM). These visual blocks constituted a small fraction (3 out of 13 blocks in the main part, i.e., on the 2nd day), and the response buttons were different from those for reporting the auditory percepts. Moreover, the relative mapping of the buttons to integrated and segregated auditory percepts were counterbalanced across participants to avoid any systematic interference effects between visual and auditory response mapping. One might argue that taking behavioral reports for different (auditory/visual) stimuli in separate blocks, with the main (multimodal) condition involving the simultaneous presentation of auditory and visual stimuli, might be confusing for participants and thus not necessarily constitute an advantage over dual-task requirements (Hupé et al., 2008). Indeed, if participants had erroneously responded to the visual rather than the auditory stimulus during multimodal blocks, this might produce an apparent cross-modal coupling of percepts which would in fact represent a misunderstanding of task requirements. To make sure that participants responded to the auditory (not the visual) stimulus during multimodal blocks, we appended response verification segments with disambiguated versions of either the auditory or the visual stimulus. Unlike the hit rate for the auditory disambiguated phases (84.1 % on average for the included N = 16), the “pseudo” hit rate for the visual disambiguated phases (i.e., the hit rate under the assumption that participants wrongfully reported their visual percept) was at 51.1 % not different from chance, t(15) = 0.73, p = .48. This clearly indicates that perceptual reports were made in response to the auditory stimulus, that is, participants complied with the instruction. Having one modality as task-irrelevant (which is possible through the visual no-report measure, Frässle et al., 2014) also prevents participants from applying certain response strategies that are difficult to avoid in the dual-task case. Such response strategies might artificially increase (e.g. deliberate switching at the same time) or decrease (e.g. deliberate abstention from switching at the same time) the probability to find moment-by-moment coupling. Eliminating response strategies and response criteria (i.e., when to transform the constantly fluctuating perceptual information into a change of the binary integrated/segregated response) is notoriously difficult (Brascamp, Becker, et al., 2018). In the current study, using the OKN to assess visual perception allows for a criterion-free measurement, since the OKN is a reflex under little to no volitional control. Response criteria in the auditory reports remain, but they can no longer increase the measure of multimodal coupling – thus if anything, they act against the hypothesis. The advantages of measuring visual perception indirectly comes at the cost of a possibly non- perfect moment-by-moment readout. We adopted a recently proposed procedure using SVM (Wilbertz et al., 2018) for classifying the visual percept. The mean accuracy across all participants was 82.9 %, which is of the same order as the 87.66 % value reported by Wilbertz, Ketkar, Guggenmos, & Sterzer, (2018) and thus conceptually replicates their work. With >80 %, the classification is sufficiently robust for the current analyses. The main feature that allows this

81 C3.1.3 Experiment 1: Discussion robust classification is the design of the specific pattern-component rivalry stimulus (Wilbertz et al., 2018), which is constructed in a way that the segregated (component) percept comes with a clear separation into foreground and background grating, and thus with an unanimous motion direction. Relative to other versions of pattern-component rivalry (Einhäuser et al., 2020; Wegner et al., 2021), where the segregated percept consists of two equally strong components and thus two simultaneously perceived motion directions, this gives a much better OKN distinction between integrated (plaid/pattern) and segregated (component) percepts. Together with the SVM-based classifier (Wilbertz et al., 2018), this led to a robustness of moment-by-moment readout that allowed us to go substantially beyond our own previous investigations: in Einhäuser, da Silva, & Bendixen, (2019), we had likewise taken direct measures of auditory perception and OKN-based indirect measures of visual perception. We showed that the average OKN gain in the motion direction of the integrated visual percept was higher during integrated than during segregated auditory percepts. We were, however, not able to show this coupling on a moment-by-moment level, and we were not able to show any coupling for the converse case (i.e., in the motion direction of the segregated visual percept, because there were two motion directions as denoted above). The moment-by-moment coupling observed in the current study thus considerably strengthens the evidence in favor of a partly supra-modal mechanism of perceptual bistability. Another important difference between the current study and Einhäuser, da Silva, & Bendixen, (2019) is the individual adjustment of stimulus parameters (frequency difference of the tones in audition, opening angle of the gratings in vision) such that integrated and segregated percepts would be equally likely in each participant and modality. The adjustment procedure produced a balanced distribution of visual percepts, while it was not entirely successful for auditory bistability. Future studies should improve the auditory parameter adjustment; the adjustment part most likely needs to be prolonged to allow for bistability to stabilize (Denham et al., 2014). Importantly, the balanced results for visual bistability successfully abolished any correlations between the proportions of integrated/segregated percepts between the visual and auditory modalities across participants (r < .1, p >.76, N = 16). This rules out that participants who show consistent biases in both modalities dominate the overall measure of coupling (though such asymmetries could also be accounted for in the analysis, see Hupé et al. (2008). The current outcome that visual perception was symmetric while auditory perception tended to be asymmetric in many participants, in fact reduced the chance of finding crossmodal coupling. This – in addition with limitations in auditory report quality and in the accuracy of visual decoding – put an upper limit on the moment- by-moment coupling that could be obtained (i.e., the expectable maximum was far below 100 %). It is remarkable that above-chance coupling was nevertheless observed in the majority of participants. Future studies should mainly improve on the symmetry of auditory percept distributions (via suitable parameter choices) and on the quality of auditory report, both of which

82 C3.1.3 Experiment 1: Discussion can be achieved by a longer adjustment and training procedure (Denham et al., 2014; Einhäuser, Thomassen, & Bendixen, 2017). Here, we deliberately invited participants with little to no experience in such paradigms to avoid effects of excessive training. We predict that with improvements in terms of a controlled amount of training, even higher degrees of cross-modal coupling will be observable, which will be beneficial for studying the involved mechanisms in more detail. The degree of cross-modal coupling observed herein is highly similar to the “common time” of corresponding percepts found by (Hupé et al., 2008) for their combination of auditory streaming and pattern-component rivalry (see their Figure 3.5A, “plaid” condition). The current study transfers this from a dual-task situation (Hupé et al., 2008) to a measurement without response interference, which indicates that true perceptual effects underlie the co-occurrence of corresponding percepts. It is tempting to speculate on the mechanisms of this multimodal coupling. Although there are corresponding perceptual interpretations (integrated / segregated), the auditory and visual stimuli were not so similar to each other as to involve strong perceptual grouping effects. In particular, the motions of the plaid or gratings were not in any way coupled to the timescales of the sounds (unlike for instance in (Einhäuser, Thomassen, & Bendixen, 2017), where we explicitly induced such temporal coupling). We therefore consider it unlikely that the two stimuli were bound into one perceptual object, as if the plaid were emitting the ABA_ sequence and the gratings were emitting the separate A and B sequences. Previous studies explicitly inducing perceptual grouping (e.g., Conrad et al., 2012), the same-orientation condition of (Adams & Haire, 1959), the “motion” condition of (Hupé et al., 2008) found a much higher degree of co-occurrence between the percepts. Here, we deliberately use percepts that are compatible (one-object / two objects in each modality), but do not strongly group together. That we nevertheless observe cross-modal coupling indicates a more global tendency to perceive corresponding percepts when presented with bistable stimuli in two modalities simultaneously. Finally, although we clearly demonstrate moment-by-moment coupling (i.e., co-occurrence of corresponding percepts), we do not investigate any causal effects, that is, whether the perceptual switches in both modalities happen at the same time, and if so, which modality takes the lead in such switches. Future studies can assess more specifically whether an induced perceptual change in one modality directly leads to a re-interpretation in the other modality, using the stimuli and methods employed herein. This will link our interference-free measurement even more closely with previous studies studying similar questions in a dual-task setting (Hupé et al., 2008). In conclusion, we have reported evidence for multimodal moment-by-moment coupling in perceptual bistability. The coupling is not perfect: there is some degree of independence in perceptual decisions across modalities. This is consistent with the results of previous studies investigating such coupling in dual-task settings, and rules out response interference as a possible

83 C3.1.4 Experiment 1: Material and Methods confounding factor. The current results are well in line with models postulating a combination of local and global mechanisms to account for bistable perception (Carter & Pettigrew, 2003; Klink et al., 2008; Long & Toppino, 2004; Noest et al., 2007; Sterzer et al., 2009; Walker, 1978).

C3.1.4 Experiment 1: Material and Methods

Participants A total of 25 volunteers (9 men, 16 women, age: 20-31, mean age: 23.4 years) were recruited from the TU Chemnitz community and participated in the experiment. After applying the a priori defined inclusion criteria, at least 70 % hit rate in the unimodal blocks and at least 70 % classification accuracy across unimodal visual blocks, data of 16 participants (5 men, 11 women, age: 20-31, mean age: 23.6 years) were included in the main analysis.

Number of participants, inclusion and stopping criteria The number of participants to be included in the analysis had been decided to be 16 prior to the experiment. Directly after the second session of each individual, we determined hit rates in the unimodal blocks and classification accuracy in the unimodal visual blocks, while remaining blind to all other data (except for a visual inspection of the eye-tracking data quality). If a participant did not meet the 70 % criterion for any of the measures, a replacement was added until 16 participants met all inclusion criteria.

Setup Experiments were conducted in a sound-attenuated room with no light source other than the visual display. All visual and auditory stimuli were presented using a 24-inch “ViewPixx/3D Full” LCD display (VPixx Technologies Inc., Saint-Bruno, QC Canada), which was located 55 cm from the participants head. The monitor operated at 120 Hz and 1920 x 1080 pixels for the visual display and 48 kHz for its auditory output. The auditory stimuli were presented through “HD25- 1-II”, 70 Ω, headphones (Sennheiser, Germany). Responses were recorded by the display device using the manufacturer’s “ResponsePixx” 5-button button box, whose buttons are arranged in a “plus”-shaped pattern (one in the middle, one to each direction). Gaze direction of the right eye was recorded at 1000 Hz by an “EyeLink-1000 Plus” eye-tracking system (SR Research, Ottawa, ON, Canada) in “tower mount” configuration. The eye tracker was calibrated and the calibration validated prior to each visual block on day 1 and prior to each block on day 2. Participants rested their head on a padded chin rest and a padded forehead rest to minimize head movements during the experiment. For control of the experiment, MATLAB R2016a (MathWorks, Natick, MA) and

84 C3.1.4 Experiment 1: Material and Methods its Psychophysics toolbox (Brainard, 1997; Kleiner et al., 2007; Pelli, 1997 see http://psychtoolbox.org) and EyeLink (Cornelissen et al., 2002) toolbox were used.

Stimuli Auditory stimuli. As ambiguous stimulus, we used a variant of the auditory ABA_ streaming stimulus (van Noorden, 1975). Pure tones of two different frequencies (fA and fB) were presented in alternation for 130 ms each, followed by a 20 ms silence (i.e., at a stimulus onset asynchrony [SOA] of 150 ms), where every other “B” tone was replaced by silence (Figure 3.1A). Tones had an A-weighted sound-pressure level of 70 dB(A), as measured by a hand-held “Type 2270” sound level meter and “Type 4101-A” binaural microphones (Bruel & Kjaer, Denmark). Tones started and ended with 5-ms raised-cosine amplitude ramps. Frequencies were adjusted per individual but the center frequency - the geometric mean fm between fA and fB – was fixed at 534 Hz. That is, they had equal ratios on a linear scale (fm / fA = fB / fm,), corresponding to equal distances on a log scale (log fm – log fA = log fB – log fm). For response verification parts to check the validity of participants’ reports, we constructed disambiguated versions of the auditory stimulus (as in Study 1). To simulate an integrated percept, the 130 ms tones were replaced as follows (Figure 3.1E, left; audio track in Supplement

T2A): tone A was replaced by a chirp that started at the center frequency fM, was reduced linearly to fA over 30 ms, remained at fA for 70 ms and then rose again to fM over 30 ms; tone B was replaced by the reverse pattern (30 ms rise from fM to fB, 70 ms fB, 30 ms drop to fM). To simulate a segregated percept, the reverse pattern was applied (Figure 3.1E, right; Supplement

T2B): tone A was raised from a frequency fA-Δ to fA and dropped back to fA-Δ, while tone B started at fB+Δ , dropped to fB and then rose again to fB+Δ . In both cases, fA, fM and fB matched the ambiguous case, fA-Δ and fB+Δ were chosen such that the distance in log frequency (semitones) between all subsequent frequency levels were identical, i.e.

log fA – log fA-Δ = log fm – log fA = log fB – log fm = log fB+Δ - log fB which is equivalent to

fA/fA-Δ = fm/fA = fB/fm = fB+Δ/fB in linear space.

Visual stimuli. As multistable visual stimulus, we used two overlapping drifting square-wave gratings (Figure 3.1B, movie in Supplement T1A) akin to the one used in (Wilbertz et al., 2018), except that the drift direction was rotated by 180° and that the angle between the gratings (“opening angle”) was adjusted per individual. The orientation of the pattern was such that the plaid formed by the gratings drifted always horizontally to the right; that is, both gratings formed

85 C3.1.4 Experiment 1: Material and Methods the same angle with the horizontal but with opposite signs. The gratings were black (<0.1 cd/m2) and dark grey (6 cd/m2) on a light grey (12 cd/m2) background and presented in a circular aperture with a diameter of 7.5 degrees of visual angle. Gratings had a duty cycle of 0.28 and a spatial frequency of 0.8 cycles per degree and drifted at 1.9 degrees per second perpendicular to their orientation, the black grating downward and the grey grating upward. Intersections between the gratings were colored in black, such that the black grating is typically perceived to drift on top of the grey grating (i.e., in the foreground). Disambiguated visual stimuli (Figure 3.1F) were formed by either coloring the intersections white (24 cd/m2) to induce an integrated (plaid) percept, or by removing the grey grating to induce a segregated (component) percept (Supplement T1B). During unimodal auditory blocks, a dark grey disc was presented in place of the aperture on a light grey background.

Experimental Procedure Day 1 – Instruction and stimulus adjustment. The experiment was split over two sessions that were conducted on separate days with on average 2.5 days in between (Figure 3.1G). The first session started with a demonstration of extreme cases of the auditory stimulus to demonstrate the possible percepts, using a frequency difference of 1 semitone to demonstrate the integrated percept, of 13 semitones to demonstrate the segregated percept, and 7 semitones to demonstrate multistability. Afterwards participants were instructed about their response mapping. Of the included 16 participants, in 9 the left button mapped to the segregated percept and the middle button to the integrated percept, for the remaining 7 the mapping was reversed. Participants then practiced the response mapping with the extreme examples. Once they and the experimenter were confident that the percepts and the mapping were understood, they proceeded to the individual adjustment of the auditory multistable stimulus. After this adjustment, the visual stimuli were introduced, and the response mapping – bottom button for the segregated (component) percept moving downwards, right button for the integrated (plaid) percept moving rightward – was practiced with the multistable and the two disambiguated examples, each at 45° opening angle. As the relative location of the button corresponded to the perceived direction of motion, the response mapping was kept identical for all participants to avoid any form of compatibility effects. After instruction and familiarization with the response mapping, the individual adjustment of the visual stimuli commenced.

Individual adjustments of stimuli. Separately for the visual and the auditory stimulus, we intended to achieve a balance between integrated and segregated percepts, that is, to prevent strong biases towards either percept. Since the tendency towards integration or segregation shows large inter- individual variability (Denham et al., 2014), this adjustment has to be conducted on an individual

86 C3.1.5 Experiment 1: Data analysis level. To this end, we presented different versions of the multistable stimuli in 2-min blocks. For the auditory stimulus, we started at a frequency difference of 7 semitones. If there was an abundance of integrated percepts, the frequency difference was increased in the subsequent block, if there was an abundance of segregated percepts, the frequency difference was decreased. The degree of increase or decrease depended on the strength of the observed bias. This was repeated for 4 blocks, then a final setting was decided upon and one more block with this frequency distance was run. This frequency distance was then also used throughout day 2. For the 16 included participants, these frequency distances ranged between 2 and 10 semitones. Similarly, we adjusted the visual stimulus. Here, we presented the 45° opening angle stimulus for two 2-min blocks and the 35° and the 55° for one two-min block each. The opening angle with the best balance between integration and segregation was then picked and used throughout day 2.

Day 2 – Main experiment. The main part of the experiment was run on day 2 and consisted of two unimodal auditory blocks, three unimodal visual blocks and eight bimodal blocks, each consisting of a 180 s multistable part and a 30 s disambiguated part as described in the results section. In half of the participants, the order was visual-unimodal, auditory-unimodal, 4 blocks bimodal, visual-unimodal, 4 blocks bimodal, auditory-unimodal, visual-unimodal, in the other half the order of visual-unimodal and auditory-unimodal at beginning and end was reversed (Figure 3.1G).

Ethics statement. All experimental procedures were evaluated by the applicable ethics board (Ethikkommission der Fakultät für Human- und Sozialwissenschaften) at TU Chemnitz who waived an in-depth evaluation (case no. V-219-15-AB-Modalität-14082017).

C3.1.5 Experiment 1: Data analysis

As described in more detail below, our procedures for the extraction of features and training the SVM classifier closely followed the approach proposed in (Wilbertz et al., 2018). The key underlying idea is that when observing a plaid-grating stimulus, the slow phase of the OKN aligns with the perceived motion. As the horizontal motion component for a horizontally moving plaid is larger than for a grating that drifts mostly vertical, the speed of the slow phase and to some extent its duration, depend on the percept. Since we, unlike (Wilbertz et al., 2018), are not limited by the constraints of an fMRI-compatible eye-tracking setup, we decided to deviate from their feature definitions where we could make use of the higher temporal resolution of our eye- tracking device. Since we have 3 independent unimodal-visual blocks available, we also could

87 C3.1.5 Experiment 1: Data analysis replace their leave-one-out (10-fold cross validation) procedure and instead use independent blocks as training set, set for parameter optimization, and set for evaluation.

Feature extraction from eye-movement data As Wilbertz et al. (2018), we extracted the time courses of three variables from the raw horizontal eye-position data (Figure 3.4). As a first step, periods of blinks, as detected by the EyeLink software, were marked as missing data for all variables. For technical reasons, the start and end of a blink event are surrounded by the start and the end of an apparent saccade; for the purpose of analysis, the apparent saccade’s start and end were used as the start and end of the blink. • For analyzing the first variable (horizontal velocity), we treated all detected saccades (using the EyeLink software with thresholds of 35°/s for velocity and 9500°/s2 for acceleration) as missing data, as they typically correspond to OKN fast phases. We then determined the horizontal velocity at any given timepoint as the difference in position between this timepoint and the next. We averaged these differences in an interval of ±500 ms around each timepoint (1,001 data points), where missing data were ignored (i.e., if there are k missing values, the average is taken over 1,001-k data points). • As second variable, we used the duration of consecutive data points in the direction opposing the plaid drift direction (i.e., going from right to left). Including saccades but excluding blinks (and surrounding apparent saccades), we identified the direction between subsequent time points (left or right). We then identified chunks of time points with velocity to the left and assigned the duration of the chunk to each data point in this chunk. For example, if the eye moved leftward for 50 ms, all the points between start and end of this period got assigned a value of 50 ms. This approach benefits from the low noise levels of the recording and characterizes fast phase durations. The trace of durations was then averaged as above, treating blinks as missing data. • As third variable we used the leftward chunks as for the second variable, and measured the horizontal distance covered within each chunk, again averaged with a 1 s (±500 ms) window. This measure relates to the spatial length of OKN fast phases.

88 C3.1.5 Experiment 1: Data analysis

Figure 3.4. Eye-movement variables to compute SVM features. A) First minute of horizontal eye position during a representative unimodal visual block. Black: slow phases, grey: fast phases (as determined by saccade detection algorithm), red: blink detection around missing data. B) Grey: velocity computed as difference between subsequent data points (scaled down by a factor of 10 relative to the scale on y-axis) after removing fast phases and blinks from trace in panel A; black: sliding average of velocity data using a 1 s time window. C) Grey: number of successive samples in direction of fast phase; black: sliding average over grey data. D) Grey: distance covered by successive samples in direction of OKN fast phase, black: sliding average over grey data. In all panels, periods in which the participant reported an integrated percept are marked in cyan, segregated reports in magenta and periods when no or both buttons were pressed are left unmarked.

This procedure results in three variables as function of time. For each timepoint, each of these three variables was sampled at 21 timepoints relative to the current one (at -2.0 s, -1.9 s, …-0.2 s, -0.1 s and 0 s) yielding 63 features (21x3) per timepoint. Consequently, for all timepoints, we obtained a 63-dimensional feature vector that characterized eye movements in a time window from -2.5 s to 0.5 s (2 s of samples, 1 s averaging window) relative to the timepoint of consideration, as well as a label (the reported percept). Timepoints for which any feature or the label was missing (because neither of the two percepts was exclusively reported) were not included

89 C3.1.5 Experiment 1: Data analysis in training or testing the SVM. For training the SVM, we subsampled this training data at 100 Hz for computational efficiency1; for testing the SVM, the full 1 kHz sampling was used.

SVM classifier Using the 63-dimensional feature vector at 100 Hz sampling, we trained an SVM classifier using the libSVM implementation for MATLAB (http://www.csie.ntu.edu.tw/~cjlin/libsvm) in version 3.23 (Chang & Lin, 2011). We used a linear kernel and the same 11 cost parameter settings (0.0001, 0.005, 0.001, 0.005, 0.01, 0.1, 1, 2, 5, 10, 100) for the cost parameter as in Wilbertz et al. (2018). For each participant, we trained 33 models (3 unimodal visual blocks times 11 cost parameters), selected the model(s) that predicted each of the two other unimodal visual blocks best (i.e., showed highest decoding accuracy), and evaluated their performance (in terms of decoding accuracy) on the remaining unimodal visual block (Figure 3.2). This resulted in 6 models per individual that were retained for further use2 (Figure 3.1H).

Decoding accuracy Throughout this study, the term “decoding accuracy” refers to an adjusted accuracy that takes biases in the ground truth (i.e., the reported percepts) into account. It is computed as the average accuracy of the two percepts:

#���� ������ ����� �������� ���������� ��� ���������� �� �������� �������� = >3?23?@435 #���� ������ ���������� �� �������� #���� ������ ����� �������� ���������� ��� ���������� �� �������� �������� = LM43?2@435 #���� ������ ���������� �� ��������

��������>3?23?@435 + ��������LM43?2@435 �������� = 2

This measure is insensitive to biases in the reported percepts; thus, chance level (the expected accuracy if the model’s predictions were based on random guessing) is always at 50 %. The term “predict” is to be understood in a technical machine-learning sense here, not in a literal sense of predicting the future (i.e., other fields would use the term “classify”).

1 Unsurprisingly - given the redundancy in the feature vector as consequence of smoothing with a 1-s window - the improvement of this subsampling relative to a 10-times coarser subsampling (10 Hz) is already very subtle when tested on the unimodal visual blocks, such that no relevant further improvement is expected from sampling at the full 1 kHz. 2 In principle, the same model can be selected twice, if both “predicted” visual unimodal blocks yield the same optimal cost parameter. This is the case in a total of 8 cases (out of 48 [16x3] pairs of models trained on the same block). Since this model should not be down-weighted, this is not corrected, i.e. the model is used twice, ignoring this redundancy. As there is no independence assumption when using the selected models, this is uncritical for statistical analysis. 90 C3.1.5 Experiment 1: Data analysis

Consistency in bimodal blocks is defined analogously, with the difference that while still the visual stimulus is decoded (“predicted” by the SVM), the labels stem from the auditory report.

Noise ceiling The expected maximum for the consistency between reported auditory and decoded visual percept in bimodal blocks is limited by two factors, the accuracy of the visual decoding and the ability to faithfully report one’s auditory percept. The former was estimated from the decoding accuracy in the unimodal visual blocks, the latter from the hit rate in the auditory response verification parts of the bimodal blocks. The maximum is reached, when the true auditory percept perfectly matches the true visual percept. In this case, the visual decoding will be consistent to the auditory report if either both are correct or both are incorrect (as there are only two classes, i.e. percepts). Under the reasonable assumption that at any given time-point, the ability of the SVM to decode correctly is independent of the ability of the participant to report their percept correctly, the consistency would then reach the sum of (probability visual decoding is correct) × (probability auditory report is correct) and (probability visual decoding is incorrect) × (probability auditory report is incorrect) Hence: (noise ceiling) = (decoding accuracy) × (hit rate) + (1-decoding accuracy) × (1-hit rate) Importantly, this measure is 0.5 (50 %) if either hit rate or decoding accuracy is at chance (50 %) and 1 (100 %) if both measures are perfect (or both are always wrong). We refer to this value as “noise ceiling”, in analogy to the use of this term in the computational literature. It should be noted that the noise ceiling can also be computed per class (percept) and averaged across classes akin to the accuracy. In practice, the difference between these definitions is minute for our data, amounting to a maximum of 0.6 % (percentage points) difference between the definitions (0.1 % on average for the N = 16).

Statistical analysis To test whether a population mean is significantly different from chance, we employed one-sample t-tests. To compare two conditions across observers, we used t-tests for dependent samples (paired t-tests). In all cases, two-tailed tests at an alpha level of 5 % were used.

91 C3.2 Supplement Study 2

C3.2 Supplement Study 2

This supplement can be retrieved from: https://www.tu-chemnitz.de/physik/PHKP/multistable/DissertationGrenzebach/index.html.

C3.2.1 Supplement T1: Movie of the visual stimulus material T1A. Audiovisual ambiguous segment, duration 15 s, parameter: 35° URL: https://www.tu-chemnitz.de/physik/PHKP/multistable/DissertationGrenzebach/Studie2/T1A.mp4 File Format: mp4/H.264 Size: 22.7 MB

T1B. Visual catch trial (int-seg-int-seg, four segments each 7.5 s) without ambiguous auditory stimulus, duration 30 s, parameter: 35° URL: https://www.tu-chemnitz.de/physik/PHKP/multistable/DissertationGrenzebach/Studie2/T1B.m4v File Format: M4V/MPGEG-4 Video Size: 34.3 MB

C3.2.2 Supplement T2: Soundtrack of the auditory stimulus material T2A. Disambiguated auditory stimulus: integrated segment, duration 7 s, parameter: 7 semitones URL: https://www.tu-chemnitz.de/physik/PHKP/multistable/DissertationGrenzebach/Studie2/T2A.m4a File Format: m4a/MPEG-4-Audio Size: 38 KB

T2B. Disambiguated auditory stimulus: segregated segment, duration 7 s, parameter: 7 semitones URL: https://www.tu-chemnitz.de/physik/PHKP/multistable/DissertationGrenzebach/Studie2/T2B.m4a File Format: m4a/MPEG-4-Audio Size: 40 KB

92 C4.1.1 Experiment 1: Introduction

CHAPTER 4 Study 3

“Boundaries of bimodal coupling in perceptual bistability”

This chapter will be submitted in similar form for publication as: Grenzebach, J., Einhäuser, W., & Bendixen, A., „Boundaries of bimodal coupling in perceptual bistability“ (manuscript in preparation).

C4.1.1 Experiment 1: Introduction

Multistability is a phenomenon that appears in the perceptual process of all modalities (Schwartz et al., 2012): Exposed to an ambiguous stimulus for a prolonged duration, the particular perceptual system forms several interpretations (percepts) that alternatingly dominate awareness. The sequence of alternating percepts is called “percept trace” (e.g., percept 1 lasting 3.4 s, percept 2 (5.1 s), percept 1 (4.8 s), …), and the transition between two percepts is called “perceptual switch”. Depending on the number of percepts reported, bi- or multistable stimuli are distinguished. The physical stimulus remains unchanged, but subjectively the percepts switch every few seconds in a random fashion within themselves. Perceptual switches in auditory and visual multistable processes were shown to go along with similarly shaped dilation of the pupil, which is regarded as a central cognitive marker for perceptual decisions (Einhäuser et al., 2008, Study 1). The perceptual interpretations are mutually exclusive and each equally compatible with the raw sensory evidence (Leopold & Logothetis, 1999). The dissociation of the physical stimulus, the sensory signals, and the perceptual organization allows conclusions on the interpretational character of the perceptual system in the arrangement of raw sensory information (Sterzer & Kleinschmidt, 2007). Established examples of perceptual multistability are auditory streaming (van Noorden, 1975) and verbal transformation (Warren, 1961; Warren & Gregory, 1958) in auditory perception and moving plaids (Movshon et al., 1985; Wallach, 1935) and binocular rivalry (Brascamp, Klink, et al., 2015) in visual perception.

The switching pattern exhibits a range of similarities across several types of perceptual multistability in different modalities (Schwartz et al., 2012). Classically, the individual percept trace is investigated regarding the number of the perceptual switches evoked; in addition, 93 C4.1.1 Experiment 1: Introduction separately for each type of percept: the average dominance duration (phase duration), and the dominance time relative to the whole stimulation time (percentage). Such variables are named “aggregated measures of multistability” and allow to compare the reported pattern of perception across different forms of multistability. Contrasting aggregated measures elicited by auditory and visual ambiguous stimuli separately revealed resemblances (Pressnitzer & Hupé, 2006). This observation pinpointed a universal principle or a shared mechanism contributing to perceptual multistability (T. Cao et al., 2018; Long & Toppino, 2004; Pastukhov et al., 2019). To investigate the properties of the potential common mechanism further, the simultaneous presentation of two multistable phenomena to two sensory modalities at once was addressed by several recent studies (V. Conrad et al., 2010; Hupé et al., 2008). Indeed, a moment-by-moment coupling of corresponding percepts crossmodally was reported (Study 2). Considering evidence stemming from several sensory sources on a perceptual object may help to resolve ambiguities, especially when only one in several senses signals ambiguous features (for an audio-visual review, see Koelewijn et al., 2010; McGurk & MacDonald, 1976). Such cross-modal interactions are well- studied, in contrast to situations where both modalities receive ambiguous information simultaneously.

Until now, it remained elusive how two parallel multistable processes reach out globally to one another. Either in a bidirectional or in a unidirectional manner. In the bidirectional case, a perceptual switch in either multistable process would be followed in the respective other modality by a perceptual switch to the corresponding percept (pari passu); in the unidirectional case, if the effect is either auditory-lead or visual-lead, the coupling would be one-way, and the other modality reacts subordinately. In this Study, we probe the strength of the moment-to-moment coupling for the unidirectional visual-lead case. First, perceptual bistability in vision and audition is elicited in parallel, and percepts may couple crossmodally on a moment-by-moment basis, replicating earlier findings (Einhäuser et al., 2020, Study 2). Second, in the ongoing visual bistability process, a percept is externally induced. In the third step, the response in the concurrent auditory percept is monitored. If the induced visual percept is accompanied by an equal auditory percept, the coupling has certain robustness that is visual-lead. Altogether, this allows us to characterize the properties of the crossmodal coupling in perceptual multistability. If the biased visual bistable process, in fact, influences the percepts in the auditory bistable process, the underlying coupling can be either visual-lead or bidirectional. Subsequently, it must be untangled in future studies (biasing the auditory bistable process and monitoring the visual bistable response) whether the coupling is bidirectional. If no coupling would be found in the current visual-on-auditory study, the coupling could still be auditory-lead. Despite that, the correspondence of the percepts in an

94 C4.1.1 Experiment 1: Introduction earlier audio-visual study (Study 2) rules against an entirely distributed mechanism governing perceptual multistability (Kubovy & Yu, 2012).

The compatibility of different forms of multistability phenomena and the correspondence of their respective percepts is based on earlier studies (Pressnitzer & Hupé, 2006). The similarities in multistable perception are not limited to the general switching pattern crossmodally (averages, distributions) but can be found in the inter-individual variations, as well. Correlations within individually aggregated measures (switching rate, dominance duration) uncover idiosyncratic consistency in the underlying multistable properties (Denham et al., 2018; Kondo & Kochiyama, 2018). Such techniques are somewhat sensitive to the types of multistability being compared: E.g., the evoked perceptual space, the number of percepts, needs to have the same size in both modalities. Additionally, common factors affecting perceptual interpretations across different forms of multistability in all modalities are indicated by correlations but are not necessarily perception-related. Potentially, cognitive layers encapsulating the perceptual processing (“switcher”-type vs. “remainer”-type) limit the explanatory power of the observations (Hupé et al., 2008; Kleinschmidt et al., 2012). Factors like personality traits, neurotransmitter concentration, and genetic factors might play a role in the general decision criteria to (re)evaluate the momentary perceptual representation (Farkas, Denham, Bendixen, Toth, et al., 2016; Kashino & Kondo, 2012; Kondo et al., 2017; Schmack et al., 2013). Such limitations can be ruled out by a moment- by-moment analysis of the percepts: The above-chance co-occurrence of perceptual switches in both modalities and the momentary correspondence of the respective percepts are explainable by a common mechanism across modalities (Einhäuser et al., 2020; Pressnitzer & Hupé, 2006). Until now, the findings are somewhat equivocal (V. Conrad et al., 2010; Hupé et al., 2008), which is most likely the result of limitations stemming from the participants' task to monitor and report the momentary percept in two modalities at the same time (task 1: monitor visual percept, task 2: monitor auditory percept).

The appropriate selection of the simultaneously presented ambiguous stimuli is critical to test the properties of the crossmodal coupling. Corresponding percepts have been identified by presenting the bistable stimuli auditory streaming and, for vision, pattern-component rivalry or moving plaids stimulus (specific stimulus version similar to Wilbertz et al., 2018). In their specific bistable configuration, both stimuli share the same percept space. Either the integrated percept (forming one perceptual object) or the segregated percept (forming two separate perceptual objects) is perceived. For the visual moving plaids stimulus, this concept materializes in two overlaid drifting gratings that are perceivable as one passing in front of the other (segregated percept, called “component” percept); or, as a unitary plaid (integrated percept, called “pattern” percept) moving

95 C4.1.1 Experiment 1: Introduction together in the same direction. The motion direction of the two percepts is distinctively different and evokes a specific eye-movement pattern called OKN (Frässle et al., 2014). The OKN is characterized by alternating phases of slow and fast eye-movements, resulting in a sawtooth-like pattern that has markedly different properties when following either one of the two percepts. A computational tool, namely an SVM-based classifier (Wilbertz et al., 2018), was applied to read- out the momentary percept from the OKN, just as in Study 2. By this indirect response measure, the simultaneous manual report of two bistable processes is avoided, which has particular challenges in itself. For instance, the report of a perceptual switch in one modality might impede the on-time report of the perceptual switch in the second modality. Likewise, the same might apply to the localization of attentional effects. Consequently, the visual percept can be directly read-out unobtrusively while the auditory percept is reported via manual button-press in the bimodal setting. This approach allows us to observe the true moment-by-moment coupling of percepts crossmodally.

To test the strength of the coupling, we introduce subtle visual disambiguation in a proportion of the otherwise ambiguous visual stimuli and investigate the perceptual response. For that purpose, in Experiment 1 (unimodal visual study), the functionality of the so-called inducer is explored. The inducer removes the ambiguity of the visual stimulus only for a brief moment and thus biases the current perceptual state to the integrated percept (pattern). Depending on the visual percept at the time the inducer starts of, either the integrated percept is prolonged, or the segregated percept is shortened, and a perceptual switch triggered. Contrasted against observations in the undisturbed percept traces without inducer (“no-inducer”), the effect the inducer has on the percept can be detected. In the following analysis, the so-called “difference wave” captures those perceptual responses: namely, the difference in the local perceptual dynamics at the visual “inducer” and the baseline condition without inducer (“no-inducer”).

To be able to investigate the strength of the perceptual coupling crossmodally (Experiment 2, bimodal study), the perceptual dynamics represented in the difference waves can be generated based on different channels of perceptual traces separately. First, the SVM-based read-out of the eye-movements and, second, button-based reports. Both channels signal the perceptual state at any moment (either integrated or segregated). In Experiment 1, we investigate (1) that in both channels locally (button- and SVM-based), the proportion of the perceived integrated percept increases (if it was segregated before) or remains elevated (if it was integrated before) after the visual inducer has shortly displayed. (2) In contrast to such expected local effects in the perceptual response, the inducer should not have a substantial influence on the perceptual dominance overall, as quantified in the classical aggregated measures of multistability (percentage percept and phase

96 C4.1.1 Experiment 1: Introduction duration). The expected absence of an overall effect corroborates that the impact of the inducer is limited to a local effect and is still allowing for autonomous perceptual switching in between. (3) The decoding quality of the SVM classifier must withstand the momentary disambiguation by the inducer and reflect an earlier found level of decoding quality reported earlier (Study 2). (4) If the decoding quality of the SVM (with and without inducer) is sufficiently strong, we might capture the visual percept without overt report. In Experiment 2, the visual percept is only indirectly decoded from the reflexive OKN. Based on the insights gathered in the exploratory analysis of the data of Experiment 1 and the demonstrated functionality of the inducer paradigm, the experimental hypothesis and statistical analysis for Experiment 2 have been a priori developed (see introduction of Experiment 2). There, the difference waves capture the visual (SVM-based) and auditory (button-based) percept simultaneously.

For Experiment 2, the bimodal study, the moving plaid stimulus with and without inducer, is simultaneously accompanied by the auditory streaming stimulus (Bregman, 1990; van Noorden, 1975). The stimulus consists of a sequence of short ‘A’ and ‘B’ tones (ABAB…) that differ from each other in frequency. A variation of the classical auditory streaming paradigm, skipping the second B sound, was applied here: The “ABA_”-triplet (see methods of Experiment 2). Such a stimulus configuration (ABA_ABA_...) can be perceived as originating from one sound source (the integrated percept, galloping stream, ABA_ABA_...) or two sound sources (the segregated percept, metronome stream, A_A_A_A_... & _B___B__...). One parameter of the auditory stimulus was tuned to evoke symmetric distributions of the percepts to facilitate the crossmodal co-occurrence. The adjustments were based on the frequency difference between the A and B tones (Denham et al., 2013b). The stimulus parameter (frequency distance) is adjusted to have a roughly equal distribution of time spent in the integrated and segregated percepts, individually for each participant.

If the moment-to-moment audio-visual coupling is sufficiently strong and the auditory bistable process follows the perceptual dominance of the current visual bistable process (visual lead, or bidirectional), the induced visual perceptual switch should be paralleled by a coherent auditory percept. Since the implemented inducer evokes the integrated percept visually, either the auditory percept switches (if it was segregated before) or the current integrated auditory percept is prolonged (if it was integrated before). If an integrated percept in the visual modality is induced, e.g., the proportion of the perceived integrated visual percepts after the inducer increases, simultaneously, the proportion of the perceived integrated auditory percepts will increase simultaneously.

97 C4.1.2 Experiment 1: Material and Methods

C4.1.2 Experiment 1: Material and Methods

Participants Volunteers were recruited from the student community of the TU Chemnitz via email advertisement. We collected data of maximally 12 participants but stopped when 8 participants exhibited an average SVM-decoding quality (ACC, see below) of >70 %. When the a priori set sample size of eight participants was reached, the sample exhibited the following properties: The mean age of the participants was 20.88 years; 2 males, 8 females; 1 left-handed, 7 right-handed. All participants reported in a pre-experimental questionnaire normal or corrected-to-normal vision and hearing. For technical reasons, the experiment was aborted for one participant, and the recorded data was excluded from the following analysis.

Ethics statement The conduct of the Experiments (1 and 2) conformed to the principles laid out in the Declaration of Helsinki and were determined by the applicable body (Ethikkommission der Fakultät für Human- und Sozialwissenschaften, TU Chemnitz) not to require in-depth ethics evaluation (case no. V- 219-15-AB-Modalität-14082017).

Setup All experiments were conducted in a sound-attenuated room with no light source other than the display. Visual stimuli were displayed on a 24-inch LCD display “VIEWPixx/3D Full” (VPixx Technologies Inc., Saint-Bruno, QC Canada; 120 Hz, resolution: 1920 x 1080 pixels, 52.5 x 29.7cm display-size, 55 cm distance to the head-rest). The gaze position (7 times the right eye, 1 time the left eye) was captured at 1,000 samples per second by an “EyeLink-1000 Plus” eye- tracking camera system over-head (SR Research, Ottawa, ON, Canada) with head-rest and EyeLink toolbox (Cornelissen et al., 2002) to control the recording and data handling of the eye- tracking data. Button-presses were protocolled using the integrated “ResponsePixx Handheld 5- button response box” button-box attached to the display system (five buttons oriented in plus- shaped layout; one in the middle, one to each direction). The participant was sitting on an adjustable chair; before the start of an experimental block, the fingers were positioned on the task-relevant buttons of the button-box. In this way, the necessity for fixations outside the display area was minimized. The head rested on the chin-rest and leaned against a cushioned forehead- mount to reduce the head movements. For control of the presentation of the auditory and visual stimuli, we used commercial and open-source software and packages: the MATLAB R2016a (MathWorks, Natick, MA) using Psychophysics toolbox (Brainard, 1997; Kleiner et al., 2007; Pelli, 1997 see http://psychtoolbox.org).

98 C4.1.2 Experiment 1: Material and Methods

Visual stimulus The grey-scaled visual stimulus shown on the display was either ambiguous (inducing bistability “integrated”, [abbr. “int”], and “segregated” [abbr. “seg”], Figure 4.1) or the disambiguated version thereof to induce the integrated percept (Figure 4.3 “inducer”). Both visual stimuli had been presented on a light grey background (12 cd/m2). The stimulus parameters showed a high similarity with the one selected by Wilbertz, Ketkar, Guggenmos, and Sterzer (2018), but the whole stimulus was rotated by 180°. The reason for the selection of a similar stimulus lies in the following SVM analysis that exploits the close to orthogonality of the evoked eye-movements (OKN) along with the two percepts. The moving direction of the “pattern”/integrated was from left to right, and the “component”/segregated percept was from top to bottom. The participant was instructed that the eyes can be moved freely within the moving plaids stimulus.

report no- report A B

Figure 4.1. Excerpt of the instruction: No-report block instruction (solid line) shows the visual stimulus only; report block instruction (dashed line) shows additionally the percepts (perceived moving direction). The black arrows printed on the instruction suggest the main moving direction of the percepts integrated (rightwards, A) and segregated (downwards, B).

Ambiguous stimulus. By superimposing two square-wave gratings, a bistable moving plaid stimulus was formed (Figure 4.1). The gratings had a duty cycle of 0.29, a spatial frequency of 0.76 cycles per degree (centrally), and drifted orthogonally to their respective orientation. On the light grey background (12 cd/m2), one dark grey grating (6 cd/m2) was oriented relative to central horizon of the display at 22.5°. On top of this dark grey grating, a black grating (<0.1 cd/m2) was shown tilted counter-clockwise relative to the horizontal meridian -22.5°. Both gratings intersections (between the black and dark grey grating, resulting in an opening angle of 45°) have been dyed black (<0.1 cd/m2) in the ambiguous stimulus. The speed of the unified gratings (coherent movement of both gratings) was 3.42° visual angle per second (rightwards) and 1.98° visual angle per second for the single gratings (different directions). The two gratings were overlaid by an annulus with a diameter of 7.91° visual angle. This stimulus configuration commonly induces bistability: the integrated percept, with both

99 C4.1.2 Experiment 1: Material and Methods gratings moving coherently (with the same speed and direction) rightwards (Figure 4.1A); and the segregated percept, along the black grating in the foreground, moving to the lower end of the stimulus (Figure 4.1B). A movie of this stimulus can be found in Supplement U3A.

Disambiguous stimulus (inducer). To control the momentary visual percept, the original ambiguous stimulus (Figure 4.3 “no-inducer”) was disambiguated (Figure 4.3 “inducer”); therefore, the perceived moving-direction rightwards is expected to become dominant, increasing the probability to perceive the integrated percept. For the inducer, both gratings intersections (diamond-shaped, between the black and dark grey grating) have been colored in the light grey background color (12 cd/m2). A movie of this stimulus can be found in Supplement U3B.

Experimental Procedure Thirteen blocks per participant with a duration of around 210 s each had to be completed by the participant (Figure 4.2). The task was (1) to focus on the visual stimulus and allow the stimulus to induce either percept spontaneously; or (2) additionally report the momentary subjective percept manually via button-press. For technical reasons, one participant repeated one block. The eye tracker was calibrated and validated prior to each visual block. The identity of the 13 blocks differs in two dimensions: report-type (no-report: 6 blocks, report: 7 blocks) and stimulus-type (inducer: 8 times, no-inducer: 5 times). Thus, the conditions were “report & inducer” (4 repetitions), “report & no-inducer” (3 repetitions), “no-report & inducer” (4 repetitions) and “no- report & no-inducer” (2 repetitions).

* * *

no-report inducer inducer inducer inducer inducer inducer no-report inducer inducer no-inducer & demono-inducer no-inducer no-inducer no-inducer instruction report instruction instruction

block 1 2 3 4 5 6 7 8 9 10 11 12 13 no-report report no-report

Figure 4.2. Block identity and order during the experiment: no-inducer (light grey filling) and inducer (dark grey filling); no-report (white contour) and report (black contour); black arrows indicate that the inducer time points of the respective block were transferred to the no-inducer block; (*) indicates that the block was used to train one of the three individual SVM-models.

No-report blocks. The first and last three blocks (blocks 1-3 and 11-13, no-report blocks) were completed without manual report of the percept. Ahead of blocks 1 and 11, the printed no-report instruction was to be read by the participant and discussed with the experimenter to ensure understanding. The instruction introduced the general appearance of the stimulus (ambiguous stimuli) concealing the inducer (disambiguated stimulus) and the perceptual switching aspect.

100 C4.1.2 Experiment 1: Material and Methods

The participant rested the arms while focusing visually on the displayed stimulus. In this case, the block started automatically after the eye-tracker calibration-validation procedure, screening the message to get ready for the upcoming block (2.5 s).

Report blocks. The participant was requested to manually report the current percept by pressing one of two buttons (block 4 - 10, report-blocks). The instruction was given ahead of block 4 followed by two short (30 s) demos and training blocks, confirming the participants' ability to press the suitable button exclusively along their subjective visual percepts: the lower key when the segregated percepts dominate; and the rightward key when the integrated percept dominates (Figure 4.1). Because the gaze should be directed on the visual stimulus at all times, the two buttons had to be operated blindly. Hence, the index finger of the left hand rested on the lower button, and the index finger of the right hand rested on the rightward button over the whole course of the experiment. In this configuration, the participants were able to press and hold down the button sightlessly, as long as one of the two percepts dominated. The perceptual switch (from one percept to the other) was signaled by releasing the momentarily pressed button and pressing and holding the other one.

No-Inducer blocks. In the no-inducer blocks, the visual ambiguous stimulus (Figure 4.3A) was presented for the whole bock duration of 210 s. The participant mostly perceives the gratings “endlessly” to stream in front of their eyes, inducing the two percepts alternatingly.

210 s 10 - 20 s A B 250 ms

ambiguous ambiguous

(…) inducer ~ 214 s block Figure 4.3. The design of the no-inducer (A) and the inducer block (B). The figure shows the ambiguous (identical in both) and the disambiguated version (inducer) of the visual stimulus that are displayed alternatingly.

Inducer blocks. In contrast to the no-inducer blocks, the ambiguous visual stimulus was interleaved by the disambiguated visual stimulus (Figure 4.3B) for a duration of 250 ms. The other parameters of the stimulus remained unchanged. In this fashion, the stimulus retains the original endless stream character, but the intersections change from black to light-grey (250 ms) to black on-the-move. Technically, segments of ambiguous (pseudo-randomly drawn from uniformly distributed integers, 10 – 20 s, M = 15 s) and disambiguous stimuli (250 ms) were prior to the

101 C4.1.3 Experiment 1: Data analysis experiment combined 14 times, which resulted in individually varying block durations (15.25 s times 14 equals 213.5 s). Only the first 12 inducers are analyzed. The inducer is intended to increase the probability to perceive the integrated percept, which is detected in difference wave analysis. An inducer for the segregated percept was not applied.

C4.1.3 Experiment 1: Data analysis

We recorded two types of data: The momentary gaze position of one eye relative to the visual stimulus (position on the screen) and the momentary percept (integrated/segregated/undefined) the participant reported to be in via button-press (behavioral data). In the no-report blocks, only the gaze position was recorded, and the participant does not report their current percent manually. In the button-based percept traces, the undefined state covered all button-states except the exclusive report of the integrated or the segregated percept/button. In the SVM-based percept trace, the undefined state in the eye-movements covered times where the position of the eye was not recorded (e.g., blinks). The analysis was established on both data components from a local and an overall perspective: the local effect relied on the perceptual dynamic at the inducer (SVM- and button-based), aggregated in the difference waves. The overall effect is based on two aggregated measures of perceptual dominance: percentage percept and phase duration based on the button-based data only.

Support Vector Machine The procedures for the extraction of features of the OKN and training the SVM classifier followed in most parts the approach proposed in (Wilbertz et al., 2018) and is minimally different from that in Study 2. The classification was based on an SVM-model that is trained on labeled (button- based labels: integrated, segregated) and preprocessed eye-movement data of the same participant. The eye-movement, namely the OKN, was influenced by the visual percept currently followed (int/seg): with the application of the SVM methodology, it was possible to classify samples of OKN into two classes, even in the absence of the report. Data without a label (“undefined”, i.e., no button was exclusively pressed) or missing eye data (i.e., blinks, eye- movements outside the central area [square, 290 px side length (8.25° viewing angle), positioned in the center of the display]) was necessarily ignored by the SVM and for the subsequent measures. In the blink detection applied by our eye-tracker, blinks started and ended with an apparent saccade, which is regarded as part of the blink by the algorithm (EyeLink software with threshold 35°/s for velocity and 9500°/s2 for acceleration) and marked as missing data.

102 C4.1.3 Experiment 1: Data analysis

Feature extraction from the eye-movement data (OKN). Three features were derived from the raw horizontal eye position that characterize the epochs in which the participant reported either the integrated or segregated percept: 1. Horizontal velocity (Feature 1): The velocity of the slow phases ( in the direction of the integrated percept; rightwards) was computed by calculating the difference between the two consecutive gaze positions. Fast phases were removed by an individual threshold; the individual threshold was based on the median of all fast-phase velocities opposite to the integrated percept during which the integrated percept was reported. Additionally, eye-movements contrary (i.e., going from right to left) to the integrated percept (i.e., going from left to right) were coded in feature 2 and 3 (fast phases): 2. Duration of consecutive data points (Feature 2): We computed the velocity (left or right) for each sample of the gaze position and identified chunks where the gaze drifted from right to left. The chunk duration (e.g., 50 ms) was assigned to all time points in the chunk. The temporal length of the OKN fast phases (micro- and saccades) is represented in this measure. 3. Horizontal distance covered by leftward chunks (Feature 3): Based on the second variable with the measured distance covered by each chunk. The spatial length of the OKN fast phases (micro- and saccades) is represented in this measure. These three dimensions were each replicated on 21 time-shifts relative to the original data (-2 s, -1.9 s, 1.8 s, …, -0.2 s, -0.1 s, 0 s) to optimize the alignment of the labels to the eye-movements and hence to account for the unknown latency between the two. In this way, we obtained the feature matrix with a length representing the block duration m with a resolution of 1 ms, and the width representing the 63 features. A vector contains the assigned labels and has the exact same length m. All features were smoothed (moving average with a window size of 1,001 ms) and soft normalized (M = 0, SD = 1). The feature extraction procedure is applied to all 13 blocks of this experiment.

SVM classifier. Only the feature matrix extracted from blocks 4, 7, and 10 are used for model training (see Figure 4.2). For each of the three blocks (down-sampled to 5.71 Hz), we transferred the 63 times m time points sized matrix (and the equally long vector of labels: int/seg/undefined) to the linear kernel implementation and counter-balanced it (for different frequencies of samples in class 1 and 2) to LIBSVM (Chang & Lin, 2011, see http://www.csie.ntu.edu.tw/~cjlin/libsvm) in a MATLAB environment and packages (MathWorks, Natick, MA). The nested cross-validation is based on 3 blocks (block 4 for model 1, block 7 for model 2, and block 11 for model 3), meaning that we optimized the cost-parameter based on the adjusted decoding accuracy (ACC, see below). First, eleven models were trained with all cost-parameters (0.0001, 0.0005, 0.001, 0.005, 0.01,

103 C4.1.3 Experiment 1: Data analysis

0.1, 1, 2, 5, 10, 100) on the data of one block. Second, the model is picked, which exhibits the highest adjusted accuracy (ACC, see below) in decoding the second block (cmax). Finally, the third block is decoded by the model trained with parameter cmax. In all SVM-decoding, we use the original eye-tracker sampling rate of 1,000 Hz. In the comparison of the decoded with the actual labels (button-based ground truth), we calculate the decoding accuracy ACC between the two variables in the following manner (as in Study 2):

#���� ������ ����� ������� ���������� ��� ���������� �� �������� �������� = >3?23?@435 #���� ������ ���������� �� �������� #���� ������ ����� ������� ���������� ��� ���������� �� �������� �������� = LM43?2@435 #���� ������ ���������� �� ��������

��������>3?23?@435 + ��������LM43?2@435 �������� �������� (���) = 2

The chance level of the ACC measure is 50 % because it is insensitive to imbalances in the reported percept.

SVM decoding. We obtained three SVM-models per participant, which exhibited each the highest ACC in decoding one of the other two training-blocks. Each model is used to decode the remaining twelve blocks of this experiment (individually). Consequently, for all decoded blocks, we had three times the decoded labels (int/seg) based on the classification of the OKN by models 1, 2, and 3. The decoded labels represent the momentary visual percept. Additionally, we had the recorded button-based ground truth for the report blocks only, which allowed us to compute the ACC and compare the SVM-based perceptual trace at the inducer with the button-based percept trace at the inducer (local effects, see below). For the no-report blocks, only the SVM-based percept trace was investigated at the inducer (local effect). For the local effects at the inducer, we compared the perceptual dynamics with a no-inducer baseline (blocks 1 and 13 for no-report condition, and blocks 4, 7, and 10 for report condition).

General decoding accuracy (ACC). The ACC had several purposes in our analysis: 1. Model selection in the SVM-model training procedure (report & no-inducer), 2. identifying individual datasets, which might be more “decodable” than others (inclusion criterion: >70 %; report & no-inducer), and 3. to quantify the “decodability” of the blocks in the presence of the potentially perturbating inducer (report & inducer blocks).

104 C4.1.3 Experiment 1: Data analysis

Decodability with and without inducer. The ACC quantifies the general consistency between button-based reported, and SVM-based decoded percepts in the report & no-inducer blocks and is computed for blocks with (ACCinducer, mean over 3 ACCs) and without inducers (ACCno-inducer, mean over 12 ACCs) inducers. To test whether the ACCinducer is as high as the ACCno-inducer across observers, we computed a t-test for dependent samples. If not stated differently, all following t- tests have been paired, two-tailed, and computed based on an expected alpha level of 5 %.

Local effects of the inducer (SVM-based and button-based). We are interested in the local influences of the inducer on the percepts (int/seg) in different channels (button- and SVM-based separately). The inducer is designed to evoke the integrated percept. For that reason, we expect different perceptual responses depending on the percept the participant was in at the time just before the inducer: • If the integrated percept is dominant at the time of the inducer-start, we expect that the integrated percept is stabilized and prolonged, compared to a pristine integrated percept (no-inducer). • If the segregated percept is interrupted by the inducer, we expect that an integrated percept is triggered promptly and appears earlier than it would have been without inducer (no-inducer). For that purpose, we compare percept traces that are in the integrated percept at the time of the inducer-start with percept traces that are in the integrated state [“before: integrated”] but unexposed to the inducer (no-inducer). The same logic applies to percept traces that are in the segregated state at the inducer-start, which are compared to percept traces that are in the segregated state [“before: segregated”] but unexposed to the inducer (no-inducer). The difference between the percept traces should only become observable after the inducer start: i.e., in both difference waves (before: int and before: seg, separately, see below), the probability for perceiving the integrated percept should be increased.

Difference wave The difference wave captured the millisecond-wise difference in the averaged perceptual traces between the inducer and the no-inducer condition within participants. The same analysis was exerted four times in Experiment 1 and five times in Experiment 2 with a statistical test based on the numerical insights of the foregoing study. In the following section, the difference wave procedure is described in general terms: the difference wave procedure relied on the extraction of the specific percept trace around inducers (inducer block) and contrasting it to the base-line, the “no-inducer” (no-inducer block). The percept trace is a millisecond-wise sequence of perceptual states, which can be SVM- or button-based percept trace. The number of inducer blocks and no-

105 C4.1.3 Experiment 1: Data analysis inducer blocks processed in a difference wave must be equal because the exact time points of the inducers (relative to the block-start) in the inducer blocks are transferred to the assigned no- inducer block (see Figure 4.2). There, the percept trace was extracted once more (around the time point of the inducer of the original inducer block). The percept trace at all inducers within the block (12 inducers) was obtained millisecond-wise (-1,000 ms before inducer-start [at 0 s] until +3,000 ms after). The selection of the window size was decided on after inspecting and visualizing the data of Experiment 1. Each sample in the percept trace can be in the state integrated, segregated, or undefined. If a block contains, for example, 12 inducers, 12 epochs have been read out from the inducer block, and all 12 epochs are being categorized for the percept state in that each epoch has been in 1 ms before no-/inducer-start (categories: before: int, before: seg or before: undefined). If the percept trace is button-based, the categorization depends on the percept trace itself. If the percept trace is SVM-based, this categorization may depend on the percept trace itself (SVM-based) or the button-based percept trace of the same block. In the latter case, the categorization is based on the button-based percept trace; however, the extracted epoch stems from SVM-based decoded percept traces. For example, three percept traces have been identified in the category before: int, eight percept traces have been identified in the category before: seg, and one percept trace in the category before: undefined. The very same procedure is applied to the no-inducer blocks: At the 12 time points of the assigned inducer block, 12 epochs are read out from the no-inducer block, and all 12 epochs are being categorized for the percept that each have been in 1 ms before inducer-start (categories: before: int, before: seg or before: undefined). For example, five percept traces have been identified in the category before: int, four percept traces have been identified in the category before: seg, and three percept traces in the category before: undefined. The last category was discarded (before: undefined) in both cases (no-/inducer). The number of epochs for both remaining categories (before: int and before: seg) was counted (once for the inducer and once for the no-inducer condition). A minimum number of epochs for both categories (before: int and before: seg) and both conditions (no-/inducer) within the button-based difference wave (difference wave A1, see below) was set a priori for all participants to prevent an elevated noise level. If the necessary minimum of epochs was not met, the individual contribution to the across- participant difference wave is excluded from all difference waves: In Experiment 1, four (A1 – A4) and in Experiment 2, five difference waves (B1 – B5) were retrieved. The individual contribution to the “difference wave: int” and the “difference wave: seg“ is aggregated as follows: For both categories (before: int and before: seg) and both conditions (no- /inducer), four average-traces per participant are computed separately: For every single one of the 4,001 samples, the respective proportion of epochs in the integrated percept relative to the epochs in the integrated or segregated percept at this very sample was given (%). The averaged

106 C4.1.3 Experiment 1: Data analysis individual percept traces, inducer, and no-inducer (both before: int), form the individual contribution to the “difference wave: int” was calculated by sample-wise subtraction:

difference wave(t) = inducer(t) – no-inducer(t).

The complementary procedure to obtain “difference wave: seg” was applied: The averaged individual percept traces, inducer, and no-inducer (both before: seg), form the individual contribution to the difference wave: seg by sample-wise subtraction. All individual contributions are averaged along the time dimension (mean) and finally form the aggregated percept traces difference wave: int and the difference wave: seg. In Experiment 2, a sample-by-sample statistical test is applied to each difference wave B1 – B5.

Difference wave A1 ”Report condition: The button-based percept trace”. The button-based percept traces for A1 were taken from the pre-processed data from the SVM-procedure (label vector). The percept traces consisted only of the states integrated, segregated, and undefined. The blocks 5, 8, and 9 (inducer) were paired, respectively, with blocks 4, 7, and 10 (no-inducer); All time points of the inducer-start (12) per block were transferred from block 5 to 4, 8 to 7 and 9 to 10, respectively (see Figure 4.3). In sum, 36 inducers/epochs were extracted. The minimum number of epochs was set to 10 minimum epochs: If a participant had less than 10 epochs in the constitution of the difference waves A1 (int or seg), the individual contribution was excluded from all difference waves A1 - A4.

Difference wave A2 “Report condition: The SVM-based percept trace. Categorization: SVM- based”. The SVM-based percept traces for A2 were taken from decoded data based on the preprocessed eye-movement data. The categorization (into before: int and before: seg at the inducer) was based on the SVM-based percept trace itself. Each block of preprocessed eye- movement data was decoded by 3 SVM-models. The percept traces then consisted only of the states integrated, segregated, and undefined. The blocks 5, 8, and 9 (inducer) were paired, respectively, with blocks 4, 7, and 10 (no-inducer); all time points of the inducer-start (12) within a block were transferred from block 5 to 4, 8 to 7 and 9 to 10 (see Figure 4.3 grey arrows). In sum, a total of 36 inducer times 3 SVM-models equals 108 epochs were extracted from the blocks for each participant individually.

Difference wave A3 “Report condition: The SVM-based percept trace. Categorization: button- based”. The underlying analysis was identical to difference wave A2 except for the categorization, which depended on the button-based percept trace here.

107 C4.1.4 Experiment 1: Results

Difference wave A4 “No-report condition: The SVM-based percept trace”. The SVM-based percept traces for A4 were taken from decoded data based on the preprocessed eye-movement data. The categorization (into before: int and before: seg at the inducer) could only be realized on the SVM-based percept trace itself because the button-based percept trace is absent. Each block of preprocessed eye-movement data was decoded by 3 SVM-models. Hence, the percept traces then consisted only of the states integrated, segregated, and undefined. The blocks 2 and 12 (inducer) were paired, respectively, with blocks 1 and 13 (no-inducer); all starting time points of the inducers (12) per block were transferred from block 2 to 1 and 12 to 13 (see Figure 4.3 grey arrows). In sum, a total of 24 inducer times 3 SVM-models equals 72 epochs were extracted from the blocks for each participant individually.

Overall effect of the inducer We investigated percentage percept and the phase duration in the no-/inducer conditions separately for the integrated and segregated percept. For each report block, the percentage (relative to the respective block-duration) and the median phase duration were measured separately for the integrated and the segregated percept for each participant. Only exclusive percept time was considered; i.e., overlapping times were removed from the analysis. As a result, we retain four values per block: percentage percept and median phase durations of the integrated and the segregated percepts. We calculated the mean across all participants for the inducer (4) and, separately, for the no-inducer (3) blocks. A t-test was conducted to examine the difference of the means of the variable “percentage percept” only of the integrated percept. The complementary segregate percept remained untested because independence of the levels int and seg was not given (e.g., the percept percentages complement each other to nearly 100 %). A repeated-measures two-way ANOVA with the factor Percept (levels: integrated, segregated) and factor Stimulus-type (levels: inducer, no-inducer) was conducted to compare the means in the different conditions (no-/inducer) of the phase durations.

C4.1.4 Experiment 1: Results

General decoding accuracy with and without inducer The three SVM-models were applied to decode the remaining two no-inducer and report blocks.

In the end, the six ACCno-inducer were averaged within-participants. All participants exhibited an

ACCno-inducer above the a priori set inclusion criteria of 70 % decoding accuracy (range: 80.37 % - 91.31 %, M = 86.47 %, SD = 4.03 %, Figure 4.4).

108 C4.1.4 Experiment 1: Results

The model with the highest decoding accuracy with respect to decoding the other two blocks was employed to decode all four inducer and report blocks. The mean over 12 decoding accuracies was computed in ACCinducer (Figure 4.4). The average ACCinducer across all participants was M = 83.59 % (SD = 7.77, range: 66.27 % - 89.76 %) and therefore statistically indistinguishable from

ACCno-inducer, t(7) = 1.63, p = .14.

Figure 4.4 Decoding accuracy: ACCno-inducer (grey bar); ACCinducer (black dots); a priori inclusion criteria (dotted line); chance level is at 50 %.

Local effects of the inducer (SVM-based & button-based) We computed the difference wave to separately understand the local effects of the inducer on the button-based (difference wave A1) and the SVM-based percepts (A2 - A4). Difference waves A1 - A3 relied on the same blocks with varying channels being analyzed (button- or SVM-based percept traces). The percept in the no-inducer condition was captured in difference wave A4, which is solely based on the percept decoded from eye-movements (SVM-based). To illustrate the process of conveying the raw percept trace to the final difference wave, we show in Figure 4.5A-D the intermediate steps (difference wave = inducer – no-inducer). The first column of the figures (Figure 4.5A, B) represents the across participant millisecond-wise-mean of epochs around inducers. The first-row (Figure 4.5A, C, E) plots are all solely based on epochs categorized as integrated and the second row (Figure 4.5B, D, F) on epochs categorized as segregated. The second column (Figure 4.5C, D) contains plots based on extractions around no- inducers. The time points at which the percept traces (no-inducer) were extracted stem from the assigned inducer-blocks, respectively. The third column (Figure 4.5E & F) was the millisecond- wise difference between the averaged inducer (experimental condition) and no-inducer (control condition) percept trace.

Difference wave A1 ”Report condition: The button-based percept trace” (Experiment 1). The difference wave A1 (Figure 4.5E and F) was based on the extraction of button-based percept traces. The extracted epochs were categorized on the percept traces themselves. Per participant, 109 C4.1.4 Experiment 1: Results three inducer blocks with each twelve inducers were analyzed. A total of 36 epochs per participant were therefore split into the categories “before: int”, “before: seg” and “before: undefined” in both the inducer and the no-inducer condition (Table 4.1):

A inducer B no-inducer Before: int Before: seg Before: und. Before: int Before: seg Before: und. 18 18 0 19 16 1 16 20 0 15 21 0 22 14 0 17 18 1 24 11 1 24 12 0 16 18 2 20 14 2 19 17 0 25 11 0 29 7 0 27 9 0 16 20 0 15 21 0

Table 4.1. Number of epochs constituting the difference wave A1 (Figure 4.5) with (A) and without (B) inducer. Each row represents one participant.

One participant had an asymmetrical balancing of the epochs (less than 10 epochs for the seg- percept, in both no-/inducer). Consequently, one participant did not meet the criteria introduced a priori (minimum 10 epochs) and, therefore, could not contribute to the difference waves A1 to A4.

110 C4.1.4 Experiment 1: Results

Figure 4.5. The composition of the temporal trajectories in difference wave A1 ”Report condition: button-based percept trace” (right column, E & F). Y-axis: momentary proportion of the int- percept (%) in all valid percepts (int and seg). The across-participant mean is represented by the solid white line within the either red (int, first row) or blue (seg, second row) tunnel, which symbolizes the standard error of the mean. Light grey bar in the background: No-/Inducer (250 ms), which starts at 0 s. The left and the middle columns show intermediate steps in the composition of the difference wave: The difference wave is the millisecond-wise individual difference between the inducer (left column, A & B) and the no-inducer (middle column, C & D) condition averaged across participants.

Difference wave A2 “Report condition: The SVM-based percept trace. Categorization: SVM-based” (Experiment 1). The underlying percept traces (Figure 4.6) are extracted from the percept traces based on the decoded eye-movements (SVM-based). The categorization of the extracted epochs depends on the SVM-based percept traces themselves. Still, in this condition, the participant signaled the momentary percept via button-press. All report blocks were decoded by all three SVM-models: summing up to 108 categorizable epochs per individual.

111 C4.1.4 Experiment 1: Results

difference wave A2

Figure 4.6. Difference wave A2: “Report condition: The SVM-based percept trace. Categorization: SVM-based”. Description details can be found in Figure 4.5.

Difference wave A3 “Report condition: The SVM-based percept trace. Categorization: button- based” (Experiment 1). Difference wave A3 (Figure 4.7) relies on the same extracted epochs as in difference wave A2. However, the decision on whether an epoch being categorized as before: integrated, segregated, or undefined relied on the button-based percept trace (A1). In this case, the proportion of individual percept traces being categorized one millisecond before inducer-start is not necessarily 0 % (for the category before: seg) and 100 % (for the category before: int) at the time of the inducer. The deviation reflects a certain degree of the general decoupling between the SVM-based and the button-based perceptual consistency.

difference wave A3

Figure 4.7. Difference wave A3: “Report condition: The SVM-based percept trace. Categorization: button-based”. Description details can be found in Figure 4.5.

Difference wave A4 “No-report condition: The SVM-based percept trace” (Experiment 1). In this condition, the report was absent, but the participant was instructed to focus on the visual stimuli.

112 C4.1.4 Experiment 1: Results

The difference wave A4 (Figure 4.8) is based on 72 inducers stemming from two blocks with 12 inducers each (each block is decoded by all three SVM-models).

difference wave A4

Figure 4.8. Difference wave A4: “No-report condition: The SVM-based percept trace”. Description details can be found in Figure 4.5.

All difference waves A1 – A4 are depicted in one plot (Figure 4.9).

difference wave A1 difference wave A2 difference wave A3 difference wave A4

Figure 4.9. Collapsed view of difference waves A1 – A4 together (mean traces only), difference wave: int (A), and difference wave: seg (B).

Overall effect of the inducer The overall perceptual response to the inducer was quantified in the percentage percept (%) and the phase duration (s). Both measures were based on exclusive button-press time of the report- blocks with and without inducer. For the percentage of the percept (int/seg) the exclusive button-press duration was divided by the block duration and averaged over all blocks within participants for the respective condition (no-/inducer). Across all eight participants (Figure 4.10A), the percentage of the integrated

113 C4.1.5 Experiment 1: Discussion percept (M = 54.24 %, SD = 3.61 %, no-inducer) was as high as in the inducer-blocks (M = 59.19 %, SD = 2.61 %), t(7) = -1.68, p = .14. The percentage of the segregated percept in the no-inducer condition (M = 44.13 %, SD = 3.55 %), remained untested (M = 39.03 %, SD = 2.54 %, inducer), due to lack of statistical independence from the integrated percepts. For the phase duration, the exclusive button events were collected separately for the integrated and segregated percept from all four (no-inducer) or three (inducer) blocks respectively and averaged within-block (median) and after that within-participants over the respective blocks (mean). Across participants (mean), the phase duration of the integrated percept was on average 4.85 s (SD = 0.39 s, no-inducer, Figure 4.10B) and in the inducer-blocks 5.07 s (SD = 0.48 s). The average phase duration of the segregated percept in the no-inducer condition was 4.38 s (SD = 0.75 s) and 3.70 s (SD = 0.38 s) in the inducer condition. The interaction (factor Percept x factor Stimulus-type) of the repeated measures ANOVA was not significant, F(1, 7) = 1.22, p = .305. Similarly, the factor Stimulus-type, F(1, 7) = 1.08, p = .334, was not significant. The factor Percept was not significant, i.e., there was no significant difference between the factor levels, F(1, 7) = 3.38, p = .108.

A B

Figure 4.10. Percentage percept (A) and phase duration (B) in the no-/inducer blocks; for integrated (red) and for segregated (blue) percept separately. Error bars represent the standard error of the mean across the participants.

C4.1.5 Experiment 1: Discussion

In Experiment 1, we investigated how the visual perceptual dynamics can be biased by the inducer. On the run, a disambiguated version of the otherwise ambiguous moving plaid stimulus was

114 C4.1.5 Experiment 1: Discussion interleaved for 250 ms to increase the probability to perceive the integrated percept after the inducer’s appearance. Depending on the perceptual state right before the inducer (int/seg), the integrated percept was hypothesized to stabilize (int) or switch to (seg): Resulting in an elevated probability in both conditions to perceive an integrated percept just after the inducer.

All difference waves exhibited a similar time course and were only investigated numerically in Experiment 1. The probability of perceiving an integrated percept peaks shortly (ca. 1.5 s) after the end of the inducer. At all other time points (e.g., pre-inducer), the probability of being in the integrated percept was on an equal level for both conditions (no-/inducer): numerically around baseline level. The inducer effect was independent of the percept before the inducer (int/seg). Numerically, the peak was slightly more pronounced if the participant was in the segregated state before the inducer. Possibly, the inducer evoking the integrated percept was more effective when disrupting the segregated percept than the integrated percept. We cannot provide statistical evidence for this observation at this point. However, in Experiment 2, the findings in the difference waves are statistically checked, based on the learnings from Experiment 1. In sum, the inducer showed to have the capacity to bias the visual percept independent of the momentary perceptual state.

In the report condition, the percept traces read out from the eye-movements (SVM-based) were flanked by the manual report (button-based). Numerically, the perceptual dynamics at the inducer, extracted from button- or SVM-based percept traces, showed a very similar time course. The SVM-based percept traces depended on the OKN, which is an involuntary eye-movement evoked by the bistable stimulus. Thus, the numerical consistency between the SVM- and button-based percept traces underlines the capability of the participant to rightfully report their momentary subjective percept. The report-less decoding of the perceptual state from the eye-movements can be, therefore, be regarded as an objective measure of perceptual multistability (Einhäuser et al., 2020). Since both difference waves (no-/report) display numerically a similar time course, they might represent a similar process. Potentially, the impact of the reporting process showed to be of neglectable influence because the direct report and the indirect read-out produce numerically similar difference waves. A technical detail that is of importance for Experiment 2: Whether the classification of the SVM-based percept epochs into the two difference waves (int/seg) was tied to the button- or the SVM-based percept trace at the time of the inducer, has numerically no influence on the time course.

Perceptual multistability studies often rely on the report of the subjectively experienced percept, which is circumvented in our approach. We found numerical evidence that the manual report of

115 C4.2.1 Experiment 2: Introduction the percept was not a prerequisite for reading percepts from the eye-movements. Even in the absence of the manual report (i.e., no-report condition) and in the absence of the (suggestive) instruction, the percept showed an increase in the probability to perceive an integrated percept after the inducer. In the no-report condition, the difference wave shows a numerical trend to an increasingly integrated percept in both difference waves (int/seg), but a distinct peak is absent (difference wave A4, SVM-based). In the following bimodal setting of Experiment 2, it is necessary that the visual percept can be read out from the eye-movements without the parallel button- based report, which is reserved there for the auditory report.

In Experiment 1, an overall effect of the inducer on the aggregated measures of perceptual dominance (percentage and phase duration) was not found. A significant overall effect can prove the effectiveness of the inducer in general but complicates the interpretation of the local effects in the difference waves at the same time. The perceptual dynamics before the inducer can be overshadowed by overall effects, and after the inducer, the origin of the effects would be unclear. However, no such effect was found, and that is promising for the investigation of Experiment 2.

At last, we demonstrate that the SVM-decoding procedure, which is crucial for the local effects (SVM-based), is robustly functioning in the context of the partly disambiguated stimulus: The decoding accuracy (ACC) in the inducer blocks was as high as in the no-inducer blocks and is similar to earlier reported levels of the ACC (Study 2). We suspect that the inducer is not capable of abolishing the audio-visual coupling found.

In the bimodal Experiment 2, the directionality and the strength of the multimodal coupling are investigated by applying the visual inducer inspected above. If the bimodal coupling is strong and bidirectional or visual-lead, the impact of the inducer on the auditory should be similar to the effect on the visual percept. In this case, the perceptual dynamics at the inducer should respond analogously. The following hypotheses and statistical tests have been developed based in light of the above analysis and interpretation of the data observed in Experiment 1.

C4.2.1 Experiment 2: Introduction

In the light of the results of Experiment 1, we compiled the following set of hypotheses to inquire how the biasing of the visual bistable process is influencing a concurrent auditory bistable process. At the heart of the investigation is the behavioral report (button-based) of the auditory percept at the time of the visual inducer. Control conditions are applied.

116 C4.2.1 Experiment 2: Introduction

Confirmatory research questions (1) “Does the SVM-decoding function with inducers’ presence in the bimodal setting?” • Hypothesis 1A: The visual-auditory consistency (VAC) in the bimodal condition is above chance level (analysis 1A, abbr. Ana1A). • Hypothesis 1B: The VAC in the bimodal condition with and without inducer is similar (Ana1B).

(2) “Does the visual inducer have an effect on the visual percept (unimodal setting) and on the auditory percept (bimodal setting)?” • Hypothesis 2A (unimodal visual, overall effect): The visual inducer has no influence on the perceptual dominance (percentage percept & phase duration) in the unimodal visual percept (Ana2A). Thus, the effect of the visual inducer on the visual percept is local. • Hypothesis 2B (unimodal visual, local effect): The visual inducer increases the probability of the integrated percept to be perceived visually (Ana2B).

• Hypothesis 3A (bimodal-auditory, overall effect): The visual inducer has no influence on the perceptual dominance (percentage percept & phase duration) in the bimodal-auditory percept (Ana3A). Thus, the effect of the visual inducer on the auditory percept is only local. • Hypothesis 3B (bimodal-auditory, local effect): The visual inducer increases the probability of the integrated percept to be perceived auditorily (Ana3B).

(3) “Is the effect of the visual inducer on the auditory percept mediated by the passively attended visual modality?” • Hypothesis 4 (bimodal-visual, local effect): The visual inducer increases the probability of the integrated percept to be perceived visually in the bimodal setting (Ana4).

All following hypotheses are based on local effects (i.e., perceptual responses in difference waves). The time courses of the perceptual response to the inducer are compared (i) on a moment-by- moment basis or (ii) with three parameters derived from the temporal trajectory of the difference wave: Parameter 1. maximal amplitude, Parameter 2. latency of the maximal amplitude, and Parameter 3. the mean amplitude).

(4) “In which order does the inducer affect the perceptual response?” One prerequisite for the inquiry of the order of the inducer-effect is the independence of the channel the percept is derived from (SVM- or button-based):

117 C4.2.1 Experiment 2: Introduction

• Hypothesis 5A: The visual perceptual response is equally represented in different channels (SVM- and button-based), Ana5A. Parameter-based comparisons: • Hypotheses 5B: “If the auditory percept response to the visual inducer happens before [latency] and is stronger [amplitude] than in the visual percept, the expected effect direction is in question.” o The active attendance (SVM-based, unimodal) to the visual percept leads to shorter or equally short latencies compared to the passive attendance (SVM-based, bimodal). Similarly, the mean and the maximal amplitude is larger or equally large for actively attended compared to the passively attended visual percepts (Ana5B1). o The active attendance (SVM-based, unimodal) to the visual percept leads to shorter or equally short latencies compared to the actively attended auditory percept (button-based, bimodal). Similarly, the mean and the maximal amplitude is larger or equally large for the actively attended visual percept compared to the actively attended auditory percept (Ana5B2). o The passive attendance (SVM-based, bimodal) to the visual percept leads to shorter or equally short latencies compared to the actively attended auditory percept (SVM-based, bimodal). Similarly, the mean and the maximal amplitude is larger or equally large for the passively attended visual percept compared to the actively attended auditory percept (Ana5B3).

Explorative research question “Does the effectiveness of the inducer on the visual percept predict the effectiveness of the visual inducer on the auditory percept?” The prerequisite for the inquiry of correlations between the parameters in the bimodal setting is that the extraction-type (SVM- or button-based) in the unimodal setting already produces meaningful correlations: • Hypothesis 6A: The parameters of the visual perceptual response correlate between different channels (SVM- and button-based, uni-visual-active, Ana6A). In a second step, the parameter-based correlations derived from difference waves stemming from the concurrent bi-modal stimulation (auditory, visual) are correlated. If the prerequisites in Ana6A are not fully met, mixed correlations between uni- and bi-modal parameters are further investigated. • Hypothesis 6B: The latency and the strength (amplitude) evoked by the visual inducer on the visual percept (bimodal) predicts the latency and the strength on the auditory percept as derived from the respective difference waves (bimodal, Ana6B).

118 C4.2.2 Experiment 2: Material and Methods

C4.2.2 Experiment 2: Material and Methods

The setup and the visual stimulus were identical to Experiment 1. Changes in the general procedure (e.g., two-day session) and the auditory setup, the audio-visual stimuli, and the bimodal setting are described below.

Participants Seventeen individuals participated in Experiment 2 (one participant could not appear for the experimental session and was not included in the following analysis) and had not participated in Experiment 1. The remaining sample (N = 16) exhibited the following demographic properties: The mean age of the participants was 22.5 years; whereof 3 identified as male, 13 as female. Two as left-handed, 14 right-handed. The data collection was disrupted by the measures (“lock- down”) taken to prevent the further spread of the SARS-CoV-2 pandemic in Germany: originally, it was planned to collect data of 24 participants.

Setup Fifteen times the right eye and; one time the left eye, for technical reasons, was recorded. Additional to the described setup of Experiment 1, auditory equipment for Experiment 2 was installed. The auditory stimulus was presented on headphones “HD25-1-II”, 70 Ω, (Sennheiser, Germany), which were directly controlled through the soundcard built in the “ViewPixx” display system. The sound-pressure level was externally verified with “Type 2270” hand-held sound level meter and “Type 4101-A” binaural microphones (both by Bruel & Kjaer, Denmark). The timing of the tones was externally verified with an oscilloscope and recorded internally with a jitter at the start time of the tones of on average 7.78 microseconds, SD = 6.62 microseconds.

Auditory stimulus The auditory stimulus had been ternary: Either ambiguous to induce bistability, disambiguated to mimic the two percepts (for the response verification), or silent for the unimodal visual blocks. All tones (pure and chirp) were superimposed at the start and the end by a 5 ms interval in which the A-weighted sound pressure level was controlled by a boosted cosine ramp from 0 dB(A) to 70 dB(A). At the end of the tone, vice versa from 70 dB(A) to 0 dB(A) to avoid acoustic artifacts in the headphone evoked by the abrupt deflection of the sound membrane. All tones had an average sound-pressure level of 70 dB(A).

Ambiguous auditory stimulus (pure tone). The auditory streaming paradigm implemented for Experiment 2, was a combination of two sinusoidal tones (tone A and tone B; duration 130 ms

119 C4.2.2 Experiment 2: Material and Methods each, SOA: 150 ms) arranged in a repeating ABA_ triplet: tone A, 20 ms silence, tone B, 20 ms silence, tone A, 20 ms silence, 150 ms silence, and repeating this cycle until the end of the block. Hence, 7 tones per 1,050 ms were played. The stimulus was presented to both ears identically. One premise for the coupling of auditory and visual bistability crossmodally is that already in the unimodal stimulation, the general probability of perceiving either of the percepts (percentage integrated or segregated) should be similar. For example, the coupling of multistability phenomena at parameters evoking unimodally divergent perceptual proportions (e.g., visually: 20 % integrated, 80 % segregated, and auditory: 70 % int, 30 % seg) is unlikely. Avoiding this fallacy, the stimulus parameters have been controlled to adapt the perceptual dominance across modalities on a similar level. In an earlier experiment (Study 2), it was found that the participants had been rather insensitive to the individual adaption of the opening angle in the specific visual stimulus. Since the exact visual stimulus was reproduced here, the adaption procedure was limited to the auditory stimulus, which showed in Study 2 higher susceptibility to parameter changes. The proportion of the integrated percept and the segregated percept in auditory streaming can be influenced by the difference in frequency between tone A and B (Denham et al., 2013b). Meaning that if the frequencies of the two tones are close (small difference), the integrated percept should become more probable. If the difference in the frequency is larger, the segregated percept should become dominant. The goal of the manipulation was that the perceptual distribution of the individual perception of auditory stimulus is targeted to be symmetric between percepts within a block.

Frequencies were adjusted per individual around the center frequency, the geometric mean fm between fA and fB, was fixed at 534 Hz (Figure 4.11A). That is, they had equal ratios on a linear scale (fm / fA = fB / fm), corresponding to a 10 semitones frequency difference between fA = 400

Hz and fB = 712 Hz. With a spectrum of 1 to 14 semitones (1 [fA = 518 Hz and fB = 549 Hz], 2 semitones, 3 semitones, …, 14 semitones), it was possible to control the dominating percepts to an equal level in most participants. A sample soundtrack can be found in Supplement U4A.

Disambiguous auditory stimulus (chirp). As a validation procedure of both perceptual auditory states, a response verification was applied. At the end of four out of eight bimodal blocks of the experimental session, the response verification would be formed by chirp-triplets: In four episodes consisting of disambiguated epochs of the integrated and segregated percept, the participant could demonstrate the capacity to report the currently dominating percept rightfully and instantaneous. In the disambiguated episodes, tones A and B were cyclically played in the well- known repeating “ABA_” pattern. However, tone A and B, would no longer be of constant frequency but had a dynamic range to imitate a particular percept (integrated and segregated). The chirps changed the frequency within the original tone duration (130 ms, SOA: 150 ms) in such a way that the frequencies designed within the tone nearly faded into each other (integrated

120 C4.2.2 Experiment 2: Material and Methods percept, tone A: chirp down, tone B: chirp up), or pointed symmetrically away from each other (segregated percept, tone A: chirp up, tone B. chirp down) [Figure 4.11B]. For most participants, it showed to be easy to distinguish between both stimuli and associate them with their inner model of the percepts (Study 1). The chirp tone consisted of three parts: a ramp (30 ms) starting with an “initial frequency”, which was decreased (fA+Δ and fB+Δ, chirp up) or increased (fA-Δ and fB-Δ, chirp down) linearly to the center frequency fA and fB (distance set individually), which remained constant for 70 ms. The chirp was ending with a ramp (30 ms) where the center frequency was linearly increased (chirp up) or decreased (chirp down), returning to the initial frequency (Figure 4.11B). The integrated percept consisted of the chirp up for the lower frequency tone A and the chirp down for the higher frequency tone B. The segregated percept consisted of the chirp down for the lower frequency tone A and the chirp up for the higher frequency tone B. In both cases, fA, fM, and fB matched the ambiguous case, fA+Δ and fB+Δ were chosen such that the relative distance between all subsequent frequency levels were identical, fA / fA+Δ = fm / fA = fB / fm = fB+Δ / fB in linear space. The procedure was applied to all 14 pairs of A and B tones to accommodate the individual semitone difference that would be used in the ambiguous stimulus. A sample soundtrack can be found in Supplement U4B.

A ambiguous B disambiguated: integrated segregated

fB+Δ

fB fB

f fm m f fA A

fA-Δ

tone- chirp- A B A _ up down up _ down up down _ 20 ms 20 ms 20 ms 20 ms 20 ms 20 ms 20 ms 20 ms 20 ms 20 ms type 20 ms type 20 ms 130 ms 130 ms 130 ms 130 ms 130 ms 130 ms 130 ms 130 ms 130 ms 130 ms 130 ms 130 ms Figure 4.11. Auditory stimulus. A) Ambiguous auditory streaming triplet “ABA_”-sequence. B) Disambiguated “ABA_”-sequence composed for integrated percept by a chirp-sequence “up-down- up”; for the segregated percept by a chirp-sequence “down-up-down”.

Silence. Over the whole course of the experiment, the participants wore the headphones, even if no sound was to be heard. In this way, participants could adapt to the tactile pressure of the headphones on their ears and head. That is the case in the unimodal visual blocks in adaptive and the experimental sessions.

121 C4.2.2 Experiment 2: Material and Methods

Experimental Procedure The experiment was split into two sessions, taking place on two days (average of 1.25 days between sessions, depending on the individual agreement with the participant). The adaptive session (day 1) consisted of instructions (introducing the concept) and the auditory stimulus adaptation procedure (changing the auditory parameter). The experimental session (day 2) was the main experiment (containing three unimodal visual, two unimodal auditory, and 8 bimodal blocks). In all blocks of Experiment 2, the participant had to report his momentary percept (integrated/segregated) through the button-box. Between the auditory and the visual tasks, the mapping of the buttons to the percepts changed for all participants.

Adaptive session: Auditory unimodal blocks. The adaptive session had two objectives: Instructing the participant on the paradigm and adapting the auditory stimulus to equal perceptual dominance in both percepts. After reading the paper instruction for the auditory stimulus, an auditory demonstration would be presented to the participant and be discussed with the experimenter. For 10 s, the ABA-triplet was played with a distance of 1 semitone (most likely triggering the integrated percept, Figure 4.12 A), then 10 s with a distance of 13 semitones (most likely triggering the segregated percept, Figure 4.12B) and, finally, for 30 s, with a distance of 7 semitones (most likely triggering ambiguity, leading to switching percepts). If the participant would feel insecure about the relation of the theoretical explanation (subjectivity, etc.) and the tones, the procedure was repeated until the participant figured out the difference. Importantly, the mapping of the auditory percept differed for one half of the participants: The left button (operated by the left-hand index finger) signaling the integrated percept and the right button (right-hand index finger) signaling the segregated percept was swapped in regard to the percept for the participants 1, 3, 5, 7, 9, 11, 14, and 16.

A B

(…) (…)

Figure 4.12. Excerpt from the auditory instruction. Visualization of possible perceptual interpretation: A) integrated and B) segregated.

After successful instruction, the auditory adaption procedure followed. The procedure contained six unimodal auditory blocks with a duration of 120 s each. The experimenter could choose the I. segregated integrated frequency-distance (in semitone steps) before each block start. The frequency-distance of the II. integrated segregated 122 C4.2.2 Experiment 2: Material and Methods first block was fixed to 7 semitones apart. For the following 4 blocks (blocks 2 to 5), the experimenter would review the proportion of the reported percepts (integrated & segregated) the participant had been giving by button-press in the last block. Accordingly, the frequency-distance is chosen for the next block that evokes equal percept dominance. The percept proportion was shown to the experimenter after each block for all passed blocks in exclusive button-press time (int/seg) relative to the block duration. Before the last block (block 6) of the auditory adaption procedure, the experimenter chose the frequency-distance that showed, in general, the smallest deviation from 50 % integrated / 50 % segregated percept. The final parameter would be applied to the auditory stimulus in the whole experimental session (pure tones & chirps). During the unimodal auditory stimulation, a dark-grey filled circle (luminance as the dark grey grating in the background, 6 cd/m2) in the size of the visual stimulus was displayed and instructed to be looked at all times.

Adaptive session: Visual unimodal blocks. After the auditory adaption procedure, the participant was instructed for the visual task. The instruction and the button-mapping to the percept read exactly as the report-instruction of Experiment 1. The button-pairs used to report the percept did not overlap crossmodally. Again, open questions were discussed, and two 30 s demonstrations of the ambiguous visual stimulus were played to the participant. If necessary, the visual demo was repeated. At this point, until the end of the experimental session, the eye-tracking was recorded. Before all blocks, the eye-link calibration and validation procedure was executed. Ultimately, four unimodal visual blocks (no-inducer, inducer, inducer, no-inducer) were presented to the participant to close the adaptive session. The first and the last unimodal visual block were no-inducer blocks with a duration of 195 s of ambiguous visual stimulus, as introduced in Experiment 1. Equally, the second and third unimodal blocks were inducer blocks. In Experiment 2, 12 inducers that would interrupt the ambiguous visual stimulus all 10 – 20 s (M = 15 s), with the disambiguated version of the visual stimulus for 250 ms, were displayed. Only 11 inducers were analyzed per block. This procedure led to block durations, which varied randomly around 195 s.

Experimental session: overview. Half of all participants passed the experimental session in the order: 1. unimodal visual, 2. unimodal auditory, 3.-6. bimodal, 7. unimodal visual, 8.-11. bimodal, 12. unimodal auditory, 13. unimodal visual. For participants 2, 4, 6, 7, 9, 11, 14, and 16, block 1 was swapped with block 2, and block 12 was swapped with block 13. Bimodal blocks could be with or without auditory response verification and with or without visual inducers. For the first group of bimodal blocks (block 3 – 6), in a predefined manner for equally sized subsets of participants, two blocks were with (inducer), and two blocks were without inducer (no-inducer).

123 C4.2.2 Experiment 2: Material and Methods

All six possible sequences were covered: e.g., no-inducer block, no-inducer block, inducer block, inducer block. For the second group of four bimodal blocks (block 8 – 11), the order of the first group of bimodal blocks inducer identity was inverted: e.g., inducer block, inducer block, no- inducer block, no-inducer block. For the first and second groups of blocks separately, auditory response verification was assigned randomly once to one of the two inducers and once to one of the no-inducer blocks. In this way, the block duration varied at around 3 - 3.5 minutes (min).

Experimental session: Visual and auditory unimodal blocks. Before each unimodal block, the respective instruction (auditory or visual), including the button-mapping, was recounted on the screen. The unimodal blocks have been ambiguous in the first 195 s in their respective modality (Figure 4.13A, B). The unimodal auditory blocks were seamlessly followed by the response verification (Figure 4.13B): The 30 s auditory response verification was split into four episodes lasting 7.5 s each. More importantly, the disambiguated auditory stimulus remained unheard by the participant until the first unimodal auditory block in the experimental session and was not further explained or discussed. The hit rate in the response verification is one measure to identify data-sets adequate for the main analysis. This measure quantifies the participants' understanding of the auditory task and the correct employment of the two buttons to the two percepts. In the unimodal auditory blocks, the participant was instructed to move his eyes freely in the empty circle stimulus displayed on the screen. In the unimodal visual blocks, the participant heard silence for the whole block duration (195 s) over the headphones. A visual response verification was not applied.

A B no-inducer 195 s unstimulated 225 s

f visual B+Δ

stim. fB

fm ambiguous fA

f auditory A-Δ 195 s stim. x ambiguous integrated 7.5 s segregated 7.5 s integrated 7.5 s segregated 7.5 s unstimulated response verification (disambiguous) or inverted Figure 4.13. The visual and auditory unimodal blocks of the experimental session. A) The unimodal visual blocks always come as no-inducer in the experimental session. B) The unimodal auditory blocks always come with an auditory response verification.

Experimental session: Bimodal blocks. Before each bimodal block, the auditory instruction was presented to emphasize the report of the auditory percept. The bimodal blocks could be with or without inducer and with or without auditory response verification. All four combinations would be realized within the first and second group of four bimodal blocks in individually differing order:

124 C4.2.3 Experiment 2: Data analysis

(1) Bimodal blocks with visual inducer showed alternatingly the ambiguous and disambiguated version of the moving plaid for 12 times. At the same, varying length, the ambiguous auditory stimulus was played. The block ended after the last, the 12th inducer (Figure 4.14A). (2) The bimodal blocks with visual inducer and with auditory response verification were shaped as (1) but seamlessly followed by 30 s of auditory response verification, which was accompanied by 30 s of the ambiguous visual stimuli (Figure 4.14A, C). (3) The bimodal blocks without visual inducer and without auditory response verification would be constituted by the ambiguous visual, and the ambiguous auditory stimulus was played along. The block ends exactly after 195 s (Figure 4.14B). (4) The bimodal blocks without visual inducer and with auditory response verification are shaped as (3) but were seamlessly followed by 30 s of auditory response verification, which was accompanied by 30 s of the ambiguous visual stimuli (Figure 4.14B, C).

A B C inducer ~195 s no-inducer 195 s no-inducer 225 s / ~225 s

10 - 20 s

250 ms

visual stim. ambiguous ambiguous ambiguous

inducer (…)

fB+Δ f fB fB B f auditory fm fm m stim. f fA fA A

fA-Δ

ambiguous ambiguous integrated 7.5 s segregated 7.5 s integrated 7.5 s segregated 7.5 s response verification (disambiguous) or inverted Figure 4.14. The bimodal blocks of the experimental session contained different versions. Always the first ca. 195 s/195 s come with an ambiguous auditory stimulus that is convoyed by either an ambiguous visual stimulus with inducer (A) or without inducer (B). Half of all bimodal blocks (one no- and one inducer block) are followed without respite by an auditory response verification (C). Auditory response verification is convoyed by an ambiguous visual stimulus for the time being.

C4.2.3 Experiment 2: Data analysis

Criterion-based data-set inclusion We collected the data in the above-described procedure of 16 participants. (1) If the hit rate was below 70 %, the individual data-sets were excluded from the whole analysis. (2) If the ACCno- inducer in the unimodal visual blocks was below 70 %, the data-sets were excluded from the SVM- based analysis (VAC and SVM-based difference waves, Analysis 1, 5, and 6) but not button-based analysis (overall effects, button-based difference waves, Analysis 2 - 4).

125 C4.2.3 Experiment 2: Data analysis

Hit rate in the response verification. The hit rate was based on the auditory disambiguated episodes at the end of four bimodal blocks and two unimodal auditory blocks (blocks 1, 9, or blocks 2, 7 of the experimental session, depending on the individual block-order). The first and the last four bimodal blocks each pertained two bimodal blocks that incorporate an auditory response verification at a random position. All exclusive time the correct button associated with the momentary played disambiguated auditory stimulus was summed up within a participant and divided by the total duration of the response verification. For technical reasons, in five blocks out of 60, the last inducer fell in the auditory response verification. Only the subsequent intact episodes were considered in the hit rate analysis. The resulting six percentages were averaged within the participant.

From ACC (unimodal) to VAC (bimodal) General decoding accuracy: ACC. The SVM-model-training (three SVM-models), the model- evaluation, and the ACC-computing procedure were exactly as in Experiment 1 based on three unimodal visual blocks without inducer (no-inducer blocks 1, 7, 13 or blocks 2, 7, 12 of the experimental session, depending on the individual block-order). In this vein, three SVM-models were trained to generate two ACCno-inducer each. Data-sets with an average general decoding accuracy (mean over 6 ACCno-inducer) lower than 70 % were excluded from all SVM-based analyses.

ACCinducer is not computed for Experiment 2.

Active and passive task relevance. Each SVM-model (1 - 3) was applied to decode the bimodal blocks in the experimental session and the last four unimodal visual blocks of the adaptive session. For all the following analyses, it is important to understand that in the adaptive session, the stimulation was only visual, and the momentary visual percept was reported (active task: visual). At the same time, the momentary visual percept was read-out from the eye-movements by the SVM-decoding, which followed distinguishable patterns, depending on the dominant percept. This setting (unimodal visual, report, no-/inducer) was the same as in the unimodal visual blocks (unimodal visual, report, all no-inducer) of the experimental session which the SVM-models were trained on (active task: visual). However, in the bimodal blocks, the visual percept has only passive task relevance: The momentary auditory percept was reported via button-press (active task: auditory), and the momentary visual percept was read out by the SVM-decoding from eye- movements. The visual passive task in the bimodal setting was specified in the instruction process to only “keep the eyes on the visual stimulus at all times and let them move freely”.

The visual-auditory consistency: VAC. Three SVM-models decoded the eye-movements in all eight bimodal blocks. There, the participant reported the momentary auditory perception (button-

126 C4.2.3 Experiment 2: Data analysis based). If the decoded labels were compared (visual percept, SVM-based) with the actual labels (auditory percept, button-based), it is possible to compute the visual-auditory-consistency as the “consistency” in Study 2. The VAC exploited the same variables as the ACC but was based on bimodal blocks: The ACC is a metric comparing the decoded with the actual labels (button-based) in the unimodal visual blocks. The VAC was calculated just like the ACC with the difference that “actual” labels reflect in the VAC the auditory percept, and the decoded labels reflect the visual percept. The VAC underwent two statistical tests, which replicates earlier findings on the general visual-auditory coupling (Study 2). For each of the eight blocks of one participant, the VAC was calculated. For the measure VACALL, all eight VACs (within-participant) were averaged and t- tested (two-sided, expected alpha error: 5 %) the against chance (50 %), which was expected to be exceeded positively. For the measures VACinducer and VACno-inducer, the respective four blocks were averaged within-participant, and subsequently, both measures were t-tested against each other; expecting, that the VAC is independent of the presence of the inducer, Ana1A.

Confirmative analysis: The impact of the visual inducer The numerical findings of Experiment 1 regarding the susceptibility of the visual percept (SVM- based and button-based) to the visual inducer were statistically tested in Experiment 2. Based on Experiment 1, the parameters to be statistically investigated were selected. In this way, the effect the inducer had on the visual and the auditory percept can be quantified.

Overall effect of the inducer. To capture the overall effect, aggregated measures of multistability were investigated implemented just as in Experiment 1.

Local effect of the inducer (auditory and visual): Difference wave. In Experiment 2, the difference waves captured the effect the inducer has… • …when the visual percept (SVM- & button-based, B1 - B3) was the active task, • …when the auditory percept (button-based, B4) was the active task, • …when the visual percept (SVM-based, B5) was the passive task. The difference waves of Experiment 2 have the badge “B_” (in Experiment 1 “A_”). The badge, e.g., “difference wave B1” is followed by a short characterization: e.g., “difference wave B1 uni- visual-active-beh/beh”. “uni” can be as well “bi” and refers to the data this particular difference wave was derived from (uni-modal or bi-modal). “visual” can be as well “auditory” and refers to the modality the percept trace (SVM-/button-based) stemmed from. The visual percept could be either “active” (SVM- & button-based, adaptive session) or “passive” (SVM-based, experimental session). The auditory percept could only be “active” (button-based). The first “beh” stood for behavioral (button-based) percept trace but could be “svm” (SVM-based) percept trace, as well,

127 C4.2.3 Experiment 2: Data analysis for the visual percept only. The second “beh” coded whether the categorization decision of the epochs was based on behavioral data (button-based, “/beh”) or decoded labels (SVM-based, “/svm”). The following difference waves were implemented: 1. Difference wave B1 uni-visual-active-beh/beh, 2. Difference wave B2 uni-visual-active-svm/svm, 3. Difference wave B3 uni-visual-active-svm/beh, 4. Difference wave B4 bi-auditory-active-beh/beh, and 5. Difference wave B5 bi-visual-passive-svm/svm. Later, three parameters (latency of the maximal amplitude, maximal amplitude itself, and the mean amplitude) were derived from the temporal trajectories of the difference waves (int/seg separately) to explore the commonalities and differences the inducer has on the perceptual dynamics capture from varying sources (B1 – B5).

Statistics on the difference waves. The difference waves (int/seg separately) were examined on a sample-by-sample basis (millisecond-wise) to characterize their temporal trajectories. This involved the comparison of difference waves (4,001 samples) against zero for each sampling point (1 ms) by t-tests. The temporal trajectory approaches zero if there was no difference between the experimental (inducer) and the control condition (no-inducer). The alpha level correction was implemented to compensate for repeated statistical testing (4,001 t-tests) with the false discovery rate (FDR) procedure, introduced by Benjamini and Hochberg (1995). Based on the distribution of p-values in the given trace, this procedure generated a corrected alpha level that lies between the uncorrected alpha error level of 5 % and the Bonferroni adjustment, which would result in an overly conservative alpha level of 5 %/4,001. Significance is denoted whenever the p-value remained under the FDR-corrected alpha level, which was reported along with each analysis. The longest chain of uninterrupted significant samples was reported with start-point, duration, and end-point. The effectiveness of the inducer on the perceptual response was proven, if in either difference wave: int or difference wave: seg, a substantially long chain of significant samples occurred after the inducer-start with a positive amplitude.

Unimodal: The effect of the visual inducer on the visual percept Adaptive session. The following analysis was based on the four unimodal visual blocks of the adaptive session (2x no-/ 2x inducer).

Ana2A. Overall effect on the visual percept (unimodal). The overall effect of the visual inducer on aggregated measures of the visual percept, the phase duration (s), and the percentage percept

128 C4.2.3 Experiment 2: Data analysis

(%) was investigated in the four unimodal visual no-inducer (2x)/inducer (2x) blocks of the adaptive session. The exact procedure of Experiment 1 was administered.

Ana2B. Local effect on the visual percept (unimodal). The visual perceptual dynamics around the inducer has been analyzed just as in Experiment 1 (SVM- & button-based difference waves) and statistically tested.

Difference wave B1 uni-visual-active-beh/beh ”Unimodal: The button-based percept trace” (Experiment 2). The button-based percept traces for B1 were taken from the preprocessed label- data based on the four unimodal visual (first “beh / ”). The categorization (into before: int and before: seg at the inducer) was determined on the button-based percept trace itself (second “ / beh”). The percept traces consisted of the states integrated, segregated, and undefined. The inducer blocks (2x) were paired, respectively, with the no-inducer blocks (2x); all starting time points of the inducers (11) per inducer block were transferred to a no-inducer block: i.e., the time points of inducers in the first, second, third, … block with inducers was transferred to the first, second, third, … block without inducers respectively. In sum, a total of 22 inducers were analyzed per participant. The minimum number of epochs was set to 5 minimum epochs: If a participant has less than 5 epochs in the constitution of the difference wave B1, the individual contribution was excluded from difference waves B1, B2, and B3.

Difference wave B2 uni-visual-active-svm/svm “Unimodal: The SVM-based percept trace. Categorization: SVM-based” (Experiment 2). The SVM-based percept traces for difference wave B2 were taken from decoded label-data based on the preprocessed eye-movement data (first “svm / ”). The categorization (into before: int and before: seg at the inducer) was determined on the SVM-based percept trace itself (second “/ svm”). Each block of preprocessed eye-movement data was decoded by 3 SVM-models. The percept traces consisted of the states integrated, segregated, and undefined. The inducer blocks (2x) were paired, respectively, with the no-inducer blocks (2x); all starting time points of the inducers (11) per inducer block were transferred to a no-inducer block. A total of 33 inducers were analyzed per participant.

Difference wave B3 uni-visual-active-svm/beh “Unimodal: The SVM-based percept trace. Categorization: button-based” (Experiment 2). Exactly as B2 except for the categorization, which depended on the button-based percept trace (“ / beh”). Difference waves B2 and B3 were relevant for Ana5 and Ana6 (see below).

Bimodal: The effect of the visual inducer on the auditory percept

129 C4.2.3 Experiment 2: Data analysis

Experimental session. The following analysis is based on the eight bimodal audio-visual blocks of the experimental session (4x no-/ 4x inducer).

Ana3A. Overall effect on the auditory percept (bimodal). The overall effect of the visual inducer on aggregated measures of the auditory percept, the phase duration (s), and the percentage percept (%) was investigated in the eight bimodal no-inducer (4x) /inducer (4x) blocks of the experimental session. The exact procedure of Experiment 1 was administered.

Ana3B. Local effect on the auditory percept (bimodal). The auditory perceptual dynamics around the visual inducer has been based on the button-based data of the bimodal blocks, which reflected the auditory percept (active task relevance).

Difference wave B4 bi-auditory-active-beh/beh ”Bimodal: The button-based percept trace” (Experiment 2). The button-based percept traces for B4 were taken from the preprocessed label- data based on the eight bimodal blocks (first “beh / ”). The categorization (into before: int and before: seg at the inducer) was determined on the button-based percept trace itself (second “ / beh”). The derived percept traces consisted of the states integrated, segregated, and undefined. The inducer blocks (4x) were paired in the order of their individual appearance, respectively, with the no-inducer blocks (4x) in the order of their individual appearance; all starting time points of the inducers (11) per inducer block were transferred to a no-inducer block. In sum, a total of 44 inducers were analyzed per participant. The minimum number of epochs was set to 10 minimum epochs: If a participant had less than 10 epochs in the constitution of the difference wave B4, the individual contribution was excluded from difference waves B4 and B5.

Ana4. Local effect on the visual percept (bimodal). The visual perceptual dynamics around the visual inducer are based on the SVM-based decoded percept traces of the bimodal blocks, which reflect the visual percept (passive task relevance).

Difference wave B5 bi-visual-passive-svm/svm “Bimodal: The SVM-based percept trace. Categorization: SVM-based” (Experiment 2). The SVM-based percept traces for B5 were taken from decoded label-data based on the preprocessed eye-movement data (first “svm /”). The categorization (into before: int and before: seg at the inducer) was determined on the SVM-based percept trace itself (second “ / svm”). Each block of preprocessed eye-movement data was decoded by 3 SVM-models. The percept traces consisted of the states integrated, segregated, and undefined. The inducer blocks (4x) were paired, respectively, with the no-inducer blocks (4x); all starting time points of the inducers (11) per inducer block were transferred to a no-inducer

130 C4.2.3 Experiment 2: Data analysis block, respectively. A total of 132 inducers per participant were investigated in the following analysis.

Bi- and unimodal: Parameter-based comparisons of the difference waves Ana5A. Prerequisites: The analysis of the parameters (latency of the maximal amplitude, maximal amplitude, and the mean amplitude) required that the underlying percept traces (SVM- or button- based) and the categorization decision (SVM- or button-based) have only a neglectable influence. The information was available for the unimodal visual difference waves B1, B2, and B3. Otherwise, the comparison in the next step may not be robust. For that purpose, the following difference waves were compared: int and difference wave: seg separately, between (Ana5A1) … • difference wave B1 uni-visual-active-beh/beh, and • difference wave B3 uni-visual-active-svm/beh. And, separately, (Ana5A2), • difference wave B2 uni-visual-active-svm/svm. and • difference wave B3 uni-visual-active-svm/beh, In all four combinations, the temporal trajectories of the difference waves were each examined regarding the statistical difference of the temporal trajectories tested on a sample-by-sample basis by a t-test against the baseline (4,001 samples, FDR-correct, expected alpha level: 5 %).

Ana5B. Parameter-based comparisons of the difference waves B1, B2, B4, and B5. 1. Parameter 1: Latency of the maximal amplitude (“Latency”) 2. Parameter 2: Maximal amplitude (in the window 0 s to 3 s relative to inducer-start) 3. Parameter 3: Mean amplitude Parameter 3, the mean amplitude, was calculated individually for each participant in a window with a size of 501 ms/samples. The windows center position (at 251 ms relative to the window start) is defined for all participants by the maximal amplitude within the longest chain of significant FDR-corrected samples on the respective group-mean difference wave. For Parameters 1 and 2 only, the appropriate correction algorithm for the used resampling method was applied, resulting in the corrected p-value, pc: Parameters 1 and 2 were derived by a delete- 1 procedure (jackknifing) from the group-mean of the temporal trajectory. The jackknife resampling technique may show a better estimate of the peak amplitude than simple averaging (Kiesel et al., 2008). The parameter comparison on parameters 1 and 2 must be corrected for the applied resampling method, as shown by Ulrich and Miller (2001) with

4 � = . 0 (QRS)

131 C4.2.3 Experiment 2: Data analysis

In the following comparisons (Ana5B1, Ana5B2, and Ana5B3), the means of the parameters (latency, maximal amplitude, and mean amplitude) were compared between the difference waves: int and difference waves: seg, separately. The following three analyses relate to the expected order of effects that the inducer has on the visual and the auditory percept: Ana5B1. “How does the task relevance (active or passive) influence the effect on visual percept?” For this analysis, the parameters were derived from • difference wave B2 uni-visual-active-svm/svm and • difference wave B5 bi-visual-passive-svm/svm are compared (t-tests). Ana5B2. “Does the inducer affect the visual and auditory percept differently?” For this analysis, the parameters derived from • difference wave B1 uni-visual-active-beh/beh and • difference wave B4 bi-auditory-active-beh/beh are compared (t-tests). Ana5B3. “How do the effects on the visual and auditory percept differ in the bimodal setting?” For this analysis, the parameters derived from • difference wave B4 bi-auditory-active-beh/beh and • difference wave B5 bi-visual-passive-svm/svm are compared (t-tests).

Bi- and unimodal: Parameter-based correlations of the difference waves The parameters derived from the difference waves quantified the strength of the inducer-effect on the respective channel (auditory, visual) and the origin of the percept trace (SVM-/button- based). Participants whose perception was inter-individual stronger influenced in one channel might experience a stronger response in the complementary channel crossmodally. Ana6A. Prerequisites: The influence of the channel the percept traces were captured from was decisive for the following correlative analysis. To assess intraindividual consistency across parameter extraction (SVM- & button-based), the Pearson correlation-coefficient was computed between each pair of parameters (SVM-/button-based x ing/seg) and tested, based on a t- distribution, against the H0 of zero correlation. For that purpose, the above three parameters (latency, maximal amplitude, and mean amplitude) were derived from difference wave: int and difference wave: seg separately, and the Pearson correlation coefficient computed between each pair (Ana6A1) of … • difference wave B1 uni-visual-active-beh/beh, and • difference wave B3 uni-visual-active-svm/beh (separately for int and seg).

132 C4.2.4 Experiment 2: Results

And, separately (Ana6A2), • difference wave B2 uni-visual-active-svm/svm. and • difference wave B3 uni-visual-active-svm/beh (separately for int and seg). Ana6B1. If all pairwise correlations exceed significance thresholds in Ana6A, all three parameters were pair-wisely correlated between • Difference waves B4 bi-auditory-active-beh/beh and • Difference waves B5 bi-visual-passive-svm/svm (separately for int and seg). Ana6B2. If some pairwise correlations did not exceed significance thresholds in Ana6A, all parameters were pair-wisely correlated between • Difference wave B1 uni-visual-active-beh/beh, and • Difference wave B4 bi-auditory-active-beh/beh (separately for int and seg)., and, separately, • Difference wave B2 uni-visual-active-svm/svm, and • Difference wave B5 bi-visual-passive-svm/svm (separately for int and seg). Significant correlations regarding the latency, maximal, and mean amplitude support the crossmodal effects.

C4.2.4 Experiment 2: Results

Hit rate in the response verification. Only one participant fell short of the hit rate threshold (>70 %). The remaining sample (15 data-sets) included in the following analysis had an average hit rate of 86.19 % (SD = 7.07 %, Figure 4.15).

ACC. Only one participant fell short of the ACC threshold (>70 %) and was subsequently excluded from Ana 1, 5, and 6. The remaining 14 data-sets showed a sufficing decoding accuracy at an average ACCno-inducer of M = 84.81 (SD = 5.58, Figure 4.15).

Ana1A. The visual-auditory consistency across all eight bimodal blocks (24 VACs per participant), the VACALL, was at 52.61 % (SD = 4.05 %) significantly higher across participants than chance (50 %), t(13) = 2.41, p = .031.

Ana1B. The comparison between VACNo-Inducer (M = 51.01, SD = 4.33) and VACInducer (M = 53.2, SD = 4.33) was not significantly different, t(13) = -1.44, p = .173 (Figure 4.15).

133 C4.2.4 Experiment 2: Results

Figure 4.15. The height of the bar plot represents the across-participants' mean for each measure. The error bar represents the standard error of the mean. Respective thresholds are dotted horizontal lines. VACALL was tested statistically against chance level (50 %); VACNo-Ind tested against VACInd.

Unimodal: The effect of the visual inducer on the visual percept Ana2A: Overall effect on the visual percept (unimodal). Across participants (Figure 4.16A), the percentage of the integrated percept (M = 55.33 %, SD = 7.38 %, no-inducer) was not significantly changing in the inducer-blocks (M = 57.47 %, SD = 8.36 %), t(14) = -1.29, p = .217. Similarly, the percentage of the segregated percept (M = 41.84 %, SD = 7.73 %), remained numerically stable (M = 40.14 %, SD = 8.22 %), but untested for dependence with the integrated percept. For the percept duration, across-participants (mean, Figure 4.16B), the interaction (factor Percept x factor Stimulus-type) of the repeated measures ANOVA was not significant, F(1, 14) = 0.74, p = .406. Same was found for factor Stimulus-type, F(1, 14) = 0.21, p = .602, which was not significant. But, the factor Percept was significantly different between the levels integrated and segregated, F(1, 14) = 8.37, p = .011. The percept duration of the integrated percept (M = 4.33 s, SD = 1.82 s, no-inducer) was not significantly longer in the inducer-blocks (M = 4.36, SD = 1.4 s). The percept duration of the segregated percept in the no-inducer condition (M = 3.5 s, SD = 1.1 s), was not significantly shorter in the inducer condition (M = 3.23 s, SD = 0.61 s).

134 C4.2.4 Experiment 2: Results

A B

Figure 4.16. Percentage percept (A) and phase duration (B) in the no-/inducer blocks; for integrated (red) and for segregated (blue) percept separately. Error bars represent the standard error of the mean across the participants.

Ana2B: Local effect on the visual percept (unimodal). A low noise level was maintained by removing participant data with less than five epochs constituting the individual contribution (int and seg) in the difference wave B1 (Table 4.2): Finally, the individual contribution of one participant is excluded from difference waves B1, B2, and B3.

135 C4.2.4 Experiment 2: Results

A Inducer B No-inducer Before: int Before: seg Before: und. Before: int Before: seg Before: und. 14 8 0 11 11 0 13 6 3 17 4 1 10 11 1 15 6 1 12 10 0 14 8 0 11 11 0 14 8 0 12 10 0 10 11 1 7 14 1 8 12 2 13 9 0 8 11 3 9 13 0 10 12 0 9 13 0 10 12 0 14 8 0 9 13 0 9 12 1 11 11 0 14 8 0 13 8 1 16 6 0 12 10 0 14 8 0 12 10 0

Table 4.2. The individually observed number of epochs that have been categorized in the three categories (int/seg/undefined) of difference wave B1 with (A) and without (B) inducer. Each row represents one participant.

The remaining data-sets of 14 participants constituted the difference waves. All three difference waves B1, B2, and B3 were obtained in the unimodal visual setting. The difference wave B1 uni- visual-active-beh/beh represents the visual percept, which is manually reported, while the visual percept is read out simultaneously from the eye-movements (SVM-based). Once, the SVM-based percept traces were categorized on the SVM-based percept traces themselves, difference wave B2 uni-visual-active-svm/svm, and once, categorized on the button-based percept trace (button- based), difference wave B3 uni-visual-active-svm/beh. The longest chain of significant samples (FDR-corrected) for all three difference waves started just after the end of the visual inducer and, in most cases, lasted (at least) to 3 s after inducer start. Numerically, in all three unimodal difference waves: seg, the significant chain starts earlier than in the difference wave: int. The perceptual dynamic expressed in the temporal trajectory of the difference wave is similar across all six difference waves. Always, the difference wave fluctuates at baseline level (isolated significant samples), peaks after the inducer end (start of the longest chain of joint significant samples), and declines back towards the baseline. The FDR-corrected longest chain of significant samples difference wave: int B1 started at 1,130 ms and ended at 2,786 ms relative to the inducer start, p < .024 (Figure 4.17A). The longest significant chain of samples in difference wave: seg B1

136 C4.2.4 Experiment 2: Results started at 563 ms, p < .032, and stopped at 3,000 ms (potentially lasting longer) relative to the inducer start (Figure 4.17B).

Figure 4.17. Difference wave B1. The group-mean (white) is embedded in a colored tunnel representing the standard error of the mean (either red: int or blue: seg). The inducer position and duration are marked in grey. Samples surviving the FDR-correction are denoted as a thick black bar on top of the plots. The FDR-threshold (pFDR) is individually displayed (surviving p- values are smaller). If no sample falls below the FDR-corrected threshold, the smallest p-value is denoted. The longest chain of significant samples is colored green (start- and endpoint relative to inducer start and chain-duration are noted).

In Ana5, percept traces were compared, and parameters were derived from the introduced unimodal visual difference waves B1, B2, and B3 each on only 13 data-sets (too low ACC of one participant). The longest significant chain of samples in difference wave: int B2 started at 1,483 ms and ended at 2,853 ms relative to the inducer start, p < .017 (Figure 4.18A). In difference wave: seg B2 the chain started at 792 ms and ended at 3,000 ms relative to the inducer start, p < .026 (Figure 4.18B).

137 C4.2.4 Experiment 2: Results

Figure 4.18. Difference wave B2. For details, see Figure 4.17.

The longest significant chain of samples in difference wave: int B3 started at 1,357 ms and ended at 3,000 ms relative to the inducer start, p < .022 (Figure 4.19A). In difference wave: seg B3 the chain started at 898 ms, p < .025, and ended at 3,000 ms relative to the inducer start (Figure 4.19B).

Figure 4.19. Difference wave B3. For details, see Figure 4.17.

Ana3A: Overall effect on the auditory percept (bimodal). For the percentage of the percept (int/seg), the exclusive button-press duration was divided by the block duration and averaged (mean) over blocks within-participant in the respective condition (no-/inducer). Numerically, across participants (Figure 4.20A), the percentage of the integrated percept (M = 40.53 %, SD = 10.18 %, no-inducer) was lowered in the inducer-blocks (M = 40 %, SD = 9.36 %). Conversely,

138 C4.2.4 Experiment 2: Results the percentage of the segregated percept (M = 54.29 %, SD = 9.70 %) percept was numerically elevated (M = 55.22 %, SD = 9.48 %). The difference was not significant, t(14) = 0.51, p = .615. For the percept duration, the exclusive button events were separately collected for the integrated and segregated percept from both unimodal visual blocks respectively and averaged within-block (median) and thereafter within-participant (mean). Numerically, across participants (mean, Figure 4.20B), the percept duration of the integrated percept (M = 4.33 s, SD = 1.82 s, no-inducer) was prolonged in the inducer-blocks (M = 4.36 s, SD = 1.4 s). Conversely, the percept duration of the segregated percept (M = 3.5 s, SD = 1.1 s) was numerically shortened (M = 3.23 s, SD = 0.61 s). The interaction (factor percept x factor stimulus-type) of the repeated measures ANOVA was not statistically significant, F(1, 14) = 0.17, p = .682. As well, the factor Stimulus- type was not statistically significant, F(1, 14) = 0.01, p = .918. But, the factor Percept was significantly different between the levels int and seg, F(1, 14) = 5.26, p = .038.

A B

Figure 4.20. Percentage percept (A) and phase duration (B) in the no-/inducer blocks; for integrated (red) and for segregated (blue) percept separately. Error bars represent the standard error of the mean across the participants.

Ana3B: Local effect on the auditory percept (bimodal). The low noise level was, as in Ana2B, maintained by removing participant data with less than 10 epochs constituting the individual contribution to the difference wave B4 (int/seg). For falling below this criterion, the individual contribution (averaged percept trace) of two participants was excluded from difference waves B4 and B5 (Table 4.3).

139 C4.2.4 Experiment 2: Results

A Inducer B No-inducer Before: int Before: seg Before: und. Before: int Before: seg Before:und. 18 25 1 17 27 0 10 33 1 19 24 1 11 33 0 6 35 3 22 22 0 31 13 0 19 25 0 19 25 0 20 24 0 22 21 1 21 23 0 16 27 1 16 18 10 22 14 8 15 27 2 21 21 2 22 22 0 17 26 1 16 28 0 13 31 0 17 27 0 21 21 2 32 10 2 28 15 1 19 25 0 17 27 0 5 39 0 5 38 1

Table 4.3. The individually observed number of epochs that have been categorized in the three categories (int/seg/undefined) of difference wave B4 with (A) and without (B) inducer. Each row represents one participant.

Remaining data-sets of 13 participants constitute the difference waves B4 (Figure 4.21). The difference waves B4 and B5 are obtained in a bimodal setting. The difference wave B4 represents the auditory percept, which is manually reported (button-based), while in parallel, the visual percept is read out from the eye-movements (difference wave B5, SVM-based). In difference wave B4, no significant sample (FDR-corrected) could establish itself in either int or seg (Figure 4.21). The shape of difference wave B4 remains flat compared to difference waves B1-B3 and B5: A distinct peak is absent in difference wave B4. The mean amplitude could not be extracted from the difference wave B4 because the procedure relies on at least one significant sample (FDR- corrected) in the percept trace. The main effect is absent: Subsequently, none of the three parameters characterizing the time course of the difference wave were derived from B4. For this reason, Ana6, which in most parts confirm the proposed theoretical mechanism additionally, was omitted. Ana5 is only partially executed: Ana5A and Ana5B1 are reported.

140 C4.2.4 Experiment 2: Results

... no-inducer --- inducer

difference wave: int difference wave: seg

Figure 4.21. Difference wave B4. For details, see Figure 4.17. Additionally, the inducer (solid black line) and no-inducer (dashed black line) percept traces are depicted

Ana4: Local effect on the visual percept (bimodal). The difference wave B5 showed similarity to difference waves B1-B3 (Figure 4.22). The longest chain of significant samples (FDR-corrected) started earlier in seg (424 ms after inducer start) than in int (836 ms). Both percept traces peaked between 1 s and 2 s after the inducer started and declined towards baseline: For difference wave: int the significant longest traces stopped at 2,661 ms and for seg at 2,967 ms, respectively. All parameters could be derived from difference wave B5, which was based on 12 data-sets (too low ACC of one participant).

Figure 4.22. Difference wave B5. For details, see Figure 4.17.

141 C4.2.4 Experiment 2: Results

Bi- and unimodal: Parameter-based comparison of the difference waves First, the difference waves of the unimodal visual setting were inspected for differences in the observation method (SVM- and button-based, across difference waves B1 - B3). In the remaining analysis, it was investigated how the task relevance (active/passive) influences the SVM-based visual percept trace (bi- vs. unimodal).

Ana5A1. Prerequisites: Comparison of the temporal trajectories (difference waves B1 and B3). Difference wave: int B1 was compared millisecond-wise with the difference wave of B3 (Figure 4.23A), and difference wave: seg B1 was compared millisecond-wise with the difference wave of B3 (Figure 4.23B). In both cases, no sample reached the FDR-corrected p-threshold (all p > .006, int, and all p > .001, seg).

... difference wave B1 --- difference wave B3

Figure 4.23. Comparison of the temporal trajectories of B1 and B3. For an individual inspection, Figure 4.17 and 4.19 are recommended. No sample (of 4,001) survives the FDR-correction: Only the smallest p-value is reported (which is above the FDR-corrected threshold). Left/red: difference wave: int (A); Right/blue: difference wave: seg (B).

Ana5A2. Prerequisites: Comparison of the temporal trajectories (B2 and B3). The same statement was true in the second comparison (Figure 4.24); neither in difference wave: int nor in difference wave: seg (comparing B2 and B3), a single significant sample (FDR-corrected) could be found (all p > .1, int, and all p > .114, seg).

142 C4.2.4 Experiment 2: Results

... difference wave B2 --- difference wave B3

Figure 4.24. Comparison of the temporal trajectories of B2 and B3. For an individual inspection, Figure 4.18 and 4.19 are recommended. No sample (of 4,001) survives the FDR-correction: Only the smallest p-value is reported (which is above the FDR-corrected threshold). Left/red: difference wave: int (A); Right/blue: difference wave: seg (B).

Ana5B1. Influence of the task relevance (active/passive) on the visual percept: In this analysis, the mean of the Parameters 1 – 3 were compared (Figure 4.25). Parameters 1 - 3 were derived from difference waves in which the task relevance was either active (difference wave B2 uni-visual- active-svm/svm) or passive (difference wave B5 bi-visual-passive-svm/svm). Participants falling below the minimum number of epochs constituting the individual contribution (Table 4.3) to the difference waves (5 for difference wave B1 and 10 for difference wave B4) were mutually discarded from the parameter data investigated here (N = 11 remain). In this analysis, no significant differences were exposed in any of the tested parameters (difference wave B2 and B5): In detail, for the integrated state at inducer (difference wave: int; Figure 4.25A, C, E), the maximal amplitude, M = 25.66 vs. M = 20.94, t(10) = 0.61, pc = .555, the latency, M = 1920.55 vs. M = 1882.55, t(10) = 0.11, pc = .915, and the mean amplitude, M = 22.97 vs. M = 21.86, t(10) = 0.12, pc = .904, exhibited no significant differences. For the segregated state at inducer (difference wave: int; Figure 4.25B, D, F), the maximal amplitude, M = 43.35 vs. M = 40.38, t(10) = 0.55, pc = .593, the latency, M = 1691.45 vs. M = 1503, t(10) = 0.63, pc = .54, and the mean amplitude, M = 46.04 vs. M = 40.25, t(10) = 1.02, pc = .331, no significant differences were detected.

143 C4.3 Experiment 1 & 2: Discussion

Figure 4.25. Comparison between the group means of the parameters derived from difference wave B2 and B5 (left column [A & B]: maximal amplitude, middle column [C & D]: latency of maximal amplitude, right column [E & F]: mean amplitude). The comparison is split for difference wave: int (red, first row) and difference wave: seg (blue, second row). The error bar represents the mean across-participants (white horizontal line) and the standard error of the mean (thin black vertical lines).

Ana5B2, Ana5B3, and Ana6. The parameter-based comparative analyses 5B2, 5B3, and the parameter-based correlative analyses in Ana6 were necessarily omitted: All tests rely on parameters derived from difference wave B4 (Figure 4.21). Since difference wave B4 did not exhibit a single significant sample (FDR-corrected survivors, all p > .001, for int, p > .058, for seg) in its percept trace, the planned test was decrepit.

C4.3 Experiment 1 & 2: Discussion

In Experiment 1 (unimodal visual), we developed an analysis to detect the visual inducer effect on the visual perceptual response. When the analysis and the visual inducer was transferred to Experiment 2, a similar peak in the probability of perceiving the integrated percept after the inducer’s appearance was replicated in the visual perceptual response (uni- and bimodal setting), but not in a simultaneous auditory perceptual response (bimodal setting). 144 C4.3 Experiment 1 & 2: Discussion

However, the general audio-visual crossmodal coupling on a millisecond-wise level was present (hypothesis 1A), and this is in line with the earlier findings, highlighting the crossmodal interplay of concurrent bistable processes mediated by global mechanism (Study 2). The inducer was developed as an instrument to evaluate the strength and directionality of the crossmodal coupling. The general crossmodal coupling itself was unaffected by the presence or the absence of the visual inducer (hypothesis 1B). The general pattern of visual dominance (overall effect) remained in our data unaffected by the visual inducer (hypothesis 2A). Both findings, the general coupling, and the visual dominance alleviate the interpretation of the local effect that the inducer evoked on the visual percept (hypothesis 2B). We found susceptibility of the visual percept (unimodal) to the visual inducer in both percepts (integrated as well as segregated). The visual inducer evoked exogenously perceptual switches in unimodal (visual) and bimodal (audio-visual, hypothesis 4) settings in the visual modality. Notably, the exogenous perceptual switches were triggered in the context of endogenous perceptual switches in between. Different channels (SVM- or button-based) tracking the visual percept produce similar, numerically congruent time courses of the perceptual dynamics at the inducer. The categorization of the SVM-based percept trace on the button-based percept traces did not yield any substantial differences. Taking together, the tracking methodology for the visual percept is appropriate because the read-out percept (SVM-based) is in accordance with the overtly reported percept (button-based) of the participants (hypothesis 5A). Vice versa, these findings provide evidence for the participants' ability to rightfully and on-time report their exogenously biased percepts. In sum, both findings reciprocally validate the coherence between directly reported subjective perception and indirect read-out of the percept. This insight might be transferable to endogenous perceptual switches as well. The task relevance (active/passive) of the visual percept might not systematically influence the effect on the visual percept; the time courses share numerically a similarly shaped trajectory, and the parameters show, indeed, no significant difference when compared between active and passive attendance. Taken together, these findings confirm that the visual perceptual response might be independent of the task relevance of the visual percept (hypothesis 5). In the later parts of hypothesis 5, the effect on the auditory percept in the bimodal setting would have been considered but remained untested (as was hypothesis 6) because the perceptual response on the auditory percept of the bimodal setting was absent. In the bimodal setting, the evoked visual perceptual switch failed to translate in a similarly shaped effect in the auditory modality. The effect is coherently absent in the integrated and segregated percept. Therefore, hypothesis 3B is rejected. Subsequently, the visual inducer would not affect the general auditory perceptual dominance (hypothesis 3A). The auditory perceptual dynamic at

145 C4.3 Experiment 1 & 2: Discussion the visual inducer showed no response. Initially, it was the scope of Experiment 2 revealing the specific local coupling at fixed time points, where a perceptual state is externally induced to allow conclusive statements on the directionality of the shared mechanism and on the strength of the coupling. In the following section, we discuss possible reasons for the absence of the local coupling: Either the actual properties of the shared mechanism limit the coupling on the local level, or our experimental design is incapable of uncovering an existent crossmodal coupling. First, we assume that the observed coupling of the applied bistable processes is reasonably stereotypical. In such a case, the universal properties of the shared perceptual mechanism (directionality and strength of the mechanism) would prohibit the detection. We also elaborate on how the data relates to the competing models of global or distributed mechanisms for the control of multistable processes in general. Second, due to the selection of the paired stimulus, the individual adaption of only the auditory stimulus parameter, the applied bistable processes may only be weakly coupled, which would, in any case, prevent us from finding evidence for an existing shared mechanism on a local level.

The shared crossmodal mechanism This experiment could not accumulate evidence for a bidirectional or visually-lead shared perceptual mechanism governing two simultaneous bistable processes. Still, the mechanism can be auditorily-lead, which we did not investigate here. Potentially, the shared mechanism is, in fact, neither bidirectional nor visually-lead. The literature report as well, in certain cases, “visual dominance” in experiments on visual capture which are materialized, e.g., the ventriloquism effect (Alais & Burr, 2004). Besides the general flexibility of the perceptual system to resolute audio- visual conflicts, it was in bimodal experiments observed that the visual modality would, in some cases, dominate the bimodal perception, uncoupling from an ongoing auditory task (Colavita, 1974; Spence et al., 2013). On the contrary, “auditory dominance” can be found in audio-visual experiments on synesthetes in the double flash illusion paradigm (Neufeld et al., 2012; Shams et al., 2002). Most likely, neither of both modalities dominate perception at all times completely. Instead, the task-relevance of the respective modality attracts attentional resources to one modality stronger (task-relevant vs. task-irrelevant). Instead of splitting attentional resources equally to both modalities (which we did not instruct), one task dominates the awareness maybe. Our results cannot be exclusively explained by the momentary decoupling of the percepts at the visual inducer. To be consistent with the found general coupling (VAC), the percepts must re- couple between most inducers. The absence of the auditory response and the presence of a simultaneous visual response could imply that the induced perceptual switch might cue selective

146 C4.3 Experiment 1 & 2: Discussion attention on the visual percept. At this stage, one can speculate that the sensitivity and attendance to a factual occurring switch in the auditory percept might have been reduced.

Assuming that a bidirectional mechanism contributes to perceptual multistability across different sensory modalities, the absence of the effect in our data might signal that the coupling coefficient of the bistable processes is weak (or slow) and cannot be detected with the proposed analysis. The bimodal resonance capturing the synchronicity of the two oscillating bistable processes might only develop on a longer time-scale (as captured in the VAC). The visual inducer might evoke a switching between the percepts in the processed auditory percept as well, which is merely not reported. Instead, the observer “waits” until consistency between both percepts (average phase duration is only ca. 3.5 s) is again reached and adopts the reported perceptual state only when the consistency between auditory and visual percept is given again. This challenge could be faced with a stimulus set that evokes much longer phase durations (e.g., by slowing down the motion in moving plaids stimulus and similar for the auditory stimulus).

Alternatively, one might speculate that the mechanism linking the oscillatory bistable processes crossmodally is susceptible to decouple if the sensory information distinctively changes in one modality (disambiguated by the inducer). Selective attention, triggered by the inducer, might guide all attentional resources to only the visual percept neglecting the auditory percept. The inhibited perceptual state (unconscious) of the auditory stimulus is no more processed. Only the currently dominating and reported perceptual state “survives”. Potentially, the mismatching sensory information is classified as noise, which in turn decouples the percepts’ co-activation of percepts (J. Xu et al., 2014). Because the (visual) sensory information change is short, the bistable (visual) process can be re-initiated but needs some time to “swing” back into the common resonance scheme with the auditory bistable process again.

Another compatible model is a non-interactive mediating structure: The cross-modal coupling would be of emitting but not mediating (and receiving) character. In such a case, the shared mechanism would not receive information on the perceptual state the multistable processes are in at that moment. Still, the mechanism would send “updating requests” (switch “to int” or “to seg”) to the multistable processes to “renew” the momentary perceptual state (Leopold & Logothetis, 1999). Theories of the internal clock or central pacemaker can explain why the undisturbed multistable processes oscillate coherently (M. Treisman et al., 1990). A fully distributed mechanism that would result in the independence of the multistable processes is ruled out because a general coupling was found repeatedly and maintained in the VAC.

147 C4.3 Experiment 1 & 2: Discussion

In the experimental design, we decided for the inducer exogenously disrupts the current bistable coupling before it returns after some unknown latency. The decision in favor of the inducer was supported by the distinctive effect on the visual percept in Experiment 1 and the benefit for the analysis of being equipped with external reference points to inspect the perceptual dynamic during the block. To cater for a potential fragile coupling, we might inspect only the no-inducer blocks at times of endogenous auditory perceptual switches: In an ex-post (see Supplement U1) analysis of the auditory perceptual switches (endogenously, button-based) during the bistable (no-inducer condition), a non-significant, but looming, peak in the visual percept (SVM-based) was detected (Figure 4.U1A, p > .008, and, B, p > .125). Such endogenous perceptual switches are different from the exogenously induced perceptual switches but might hint that the coupling of undisturbed bistable processes can actually be possible. The requisite might be a stimulation where both modalities are in a continuous, uninterrupted multistable state. In contrast, here, the perceptual bistability is continuously interrupted by new sensory information, evoking exogeneous perceptual switches.

Methodological fallacies But, there might be attention-unrelated factors connected to the specific stimulus configuration which prevented robust coupling between the percepts crossmodally down to an expected local level: numerically, the percentage percept (int/seg) is nearly inverted between the unimodal visual (int & no-inducer: M = 53.33 %, SD = 7.38 %, seg & no-inducer: M = 41.84 %, SD = 7.73 %; Figure 4.16A), and the bimodal-auditory percepts, (int & no-inducer: M = 40.53 %, SD = 10.18 %, seg & no-inducer: M = 54.29 %, SD = 9.7 %; Figure 4.20A) with a similar pattern in the inducer condition. Numerically, only the phase duration of the integrated percept remains crossmodally somewhat stable: unimodal visual (M = 4.33 s, SD = 1.82 s; Figure 4.16) and bimodal-auditory (M = 5.08 s, SD = 3.47 s; Figure 4.20), being independent of the appearance of an inducer (no- /inducer with a similar pattern). On the contrary, the phase duration of the segregated percept inflated, numerically, from the unimodal visual (M = 3.5 s, SD = 1.1 s; Figure 4.16) to bimodal auditory (M = 7.42 s, SD = 5.94 s; Figure 4.20) condition, again, being independent of the appearance of an inducer (no-/inducer with a similar, inflated, pattern). Such observation might hint that the coupling is hindered by the unequal, idiosyncratic dominance of the percepts crossmodally. Even though the auditory stimulus was individually adapted (unimodally) for the percentage percept reaching a level of 50 % integrated and 50 % segregated percept, that did not materialize in the bimodal, as well as in the unimodal auditory setting (see Supplement U2). The un-adapted visual stimulus (in this study, the opening angle was not individually tuned to produce 50 % integrated and 50 % segregate percept, as in Study 2) did, as well, not level off on a 50 %

148 C4.3 Experiment 1 & 2: Discussion integrated and 50 % segregated (Figure 4.16). The asymmetrical distribution is robustly persisting in single (unimodal) and parallel (bimodal) stimulation for the auditory percept (Figure 4.U2A & B). The level of percentage percept and phase durations have been reproduced several times in Experiment 1, Experiment 2, adaptive and experimental session and are numerically coherent. The overall effects (percentage percept and phase duration) for the visual percept have never been captured in a bimodal setting because the SVM-based percept trace fluctuations would only allow dominance measures based on arbitrary set thresholds. The inversion of the asymmetrical distributions of the overall measures from visual to auditory perception can be found in both settings: in the unimodal visual data of adaptive (Figure 4.16) and experimental session (Figure 4.U2A, C), which are numerically coherent. In the unimodal auditory data (Figure 4.U2B, D) and in the bimodal-auditory data, both stemming from the experimental session (Figure 4.20), the overall effects are numerically coherent as well. The inversion can be found as well when comparing the unimodal auditory and visual percentage and phase duration. This may explain the lowered probability of coupling of the percepts because of the perceptual dominance gapes crossmodally. Alternatively, the growing proportion of the segregated can be the result of the sequence of presentation (first, unimodal, second, bimodal) and successively prolonged stimulation. Still, the asymmetrical distribution in dominance time is not only limiting the general coupling (just above chance level: M = 52.61 %, SD = 4.05 %) but also inhibits gradually, with growing asymmetry, the detection of local effects which depend directly on the general consistency. One possible solution is the replacement of one or both of the paired stimuli with a “better” fitting alternative. Another is the introduction of the adaption procedure for the visual stimulus as well (i.e., opening angle), which was insensitive in Study 2. Otherwise, the auditory perceptual dominance can be adapted to meet the individually observed visual perceptual dominance. It proved to be challenging to control the auditory perceptual dominance in a stable, enduring manner between two days. But, it might be promising to adapt the auditory stimulus (frequency difference between tones A and B, see procedure in Study 2) incrementally between the blocks of the experimental session to visual dominance pattern. Both approaches' outcome is similar: The improved of the visual “in” the auditory and vice versa might facilitate the coupling better and allow fair testing.

Another impeding factor maybe an insufficient sample size. If the coupling is rather weak and the forwarding effect on the auditory percept per se brittle, the small sample size (N = 17) could have played a role in the absence of the effect. The original sample size of 24 was lessened due to the SARS-CoV-2 pandemic in this dissertation and will be completed at a later time point for a scientific publication. If the coupling effect is neither bidirectional nor visually-lead, the increase of the sample size will not contribute to detecting an effect in the experiment at hand. In light of

149 C4.3 Experiment 1 & 2: Discussion the available data (flat line results in Figure 4.21), it seems less likely that the sample size increase will reveal the prospected effect.

We conducted two experiments: First, we demonstrated that a certain visual percept could be evoked externally in an ambiguous context. The induced percept remains dominant after the external manipulation disappeared. Second, we replicated earlier findings on the general coupling of the auditory and visual percepts but did not find evidence that the coupling is either bidirectional or visually-lead. For a future study, we recommend cross-checking the mechanism in an auditory-lead paradigm by inducing a certain percept in the auditory percept. If, as here, the induced effect is not forwarded to visual percept, that might speak for a rather fragile coupling. The coupling of the two oscillatory bistable processes might function on a slower time-scale, incapable of forwarding exogenously evoked perceptual switches from one modality to the other. Potentially attentional processes impede such links. Also, the cross-modal coupling of purely endogenous perceptual switches should be investigated further. Another methodological challenge that needs to be addressed when pairing the ambiguous figures presented here is the asymmetry in the relative dominance distribution of the percepts between the modalities. For optimal general coupling and fair testing of local effects, the stimulus parameters (opening angle for the visual stimulus and frequency difference for the auditory stimulus) should be adopted in both modalities in a way that results in similar dominance times in both percepts: integrated and segregated.

150 C4.4 Supplement Study 3

C4.4 Supplement Study 3

C4.4.1 Supplement U1: Local coupling of the percepts at endogenous auditory perceptual switches This analysis was designed after seeing all results of Experiment 1 and 2.

The analysis U1 is based on the multimodal blocks without inducers. Similar to the difference waves, the visual percept traces (SVM-based) are extracted. Instead of the inducers, the button- presses (reported auditory perceptual switch) are the reference points. Contrary to the difference waves, no baseline correction was applied here. Within-participants at all auditory perceptual switches “to int” and “to seg” the visual percept trace at the button-press are extracted from -1 s to +3 s relative to the button-press. The obtained epochs are averaged along the time dimension within-participant and thereafter across-participants (N = 14, too low ACC of one participant). Because of the general “weak” coupling, the final visual percept trace fluctuates close to chance- level at 50 %. In both cases (Figure 4.U1A and B) the averaged percept trace is indistinguishable from chance-level, all p > .008 (Figure 4.U1A, FDR-corrected) and all p > .125 (Figure 4.U1B, FDR-corrected). Numerically, at both auditory perceptual switches (to int/ to seg), a subtle peak of the visual percept in the direction of the respective auditory perceptual switch is observed.

Figure 4.U1. Visual percept traces at the time of the auditory perceptual switches. A) perceptual switches to int B) perceptual switches to seg. Samples surviving the FDR-correction are denoted as a thick black bar on top of the plots. The FDR-threshold (pFDR) is displayed individually (surviving p-values are smaller). If no sample falls below the FDR-corrected threshold, the smallest p-value is denoted.

151 C4.4 Supplement Study 3

C4.4.2 Supplement U2: Overall measure of perceptual dominance in the unimodal condition (auditory/ visual) As in Experiment 1 and 2, the overall measures (percentage percept & phase duration) were computed within-block (median for the phase durations) and thereafter, across-participants (mean, N = 15). This analysis incorporates only data from the experimental session (experimental session): two unimodal auditory blocks (either block 16 & 18 or block 17 & 27, depending on the participant number, without response verification) and three unimodal visual blocks (either block 17, 22 & 27 or block 18, 22, & 28, depending on the individual participant number). For each unimodal block, the percentage (relative to the respective block-duration) and the median phase duration are measured separately for the integrated and the segregated percept individually. Only exclusive percept times are considered; i.e., overlapping percepts are discarded. As a result, we preserve four values per block: percentage percept and median phase durations of each, the integrated and the segregated percept. Finally, we report the mean, standard error of the mean, and standard deviation across participants on all four values for the percentage percept and phase duration separately. Finally, a two-factor, two-level repeated-measures ANOVA was computed on the factor Percept (integrated, segregated) and Modality (auditory, visual) with follow-up pairwise, two-tailed t-tests, where appropriate. The interaction (factor Percept x factor Modality) of the repeated measures ANOVA demonstrated statistical significance, F(1, 14) = 30.06, p < .0001. As well, the factor Modality showed statistical significance, F(1, 14) = 10.68, p = .006. But, the factor Percept remained insignificant, F(1, 14) = 0.123, p = .73. A follow-up paired two-sided t-test on the percepts within the factor Modality, revealed a significant difference between percepts in vision, t(14) = 3.36, p = .004, and in audition, t(14) = -3.45, p = .003.

Figure 4.U2. Overall measures of dominance in unimodal bistability (experimental session). The properties (percentage percept, %, and phase duration, s) of the percepts (int/seg) in the conditions visual (A & C) and auditory (B & D). The height of the bar represents the mean across-participants and the error bar the standard error of the mean.

152 C4.4 Supplement Study 3

C4.4.3 Supplement U3: Video of the visual stimulus material This supplement can be retrieved from: https://www.tu-chemnitz.de/physik/PHKP/multistable/DissertationGrenzebach/index.html.

U3A. Ambiguous segment, duration 11 s. URL: https://www.tu-chemnitz.de/physik/PHKP/multistable/DissertationGrenzebach/Studie3/U3A.m4v File Format: m4v Size: 12.8 MB

U3B. Ambiguous with 1 inducer segment, duration 11 s. URL: https://www.tu-chemnitz.de/physik/PHKP/multistable/DissertationGrenzebach/Studie3/U3B.m4v File Format: m4v Size: 12.8MB

C4.4.4 Supplement U4: Soundtrack of the auditory stimulus material This supplement can be retrieved from: https://www.tu-chemnitz.de/physik/PHKP/multistable/DissertationGrenzebach/index.html.

U4A. Ambiguous segment, duration 11 s, parameter: 7 semitones URL: https://www.tu-chemnitz.de/physik/PHKP/multistable/DissertationGrenzebach/Studie3/U4A.wav File Format: wav Size: 1 MB

U4B. Response verification (int-seg-int-seg, four segments each 6.5 s), duration 26 s, parameter: 7 semitones URL: https://www.tu-chemnitz.de/physik/PHKP/multistable/DissertationGrenzebach/Studie3/U4B.wav File Format: wav Size: 2.5 MB

153 C5 General Discussion

CHAPTER 5 General Discussion

C5 General Discussion The goal of this Thesis was to identify empirical evidence for local and global factors contributing to the emergence of perceptual states. These factors and their interaction have been seen to govern spontaneous alternations in the perception of ambiguous stimuli. It was observed that the perceptual switching behavior (e.g., number of switches, average dominance duration, percentages) in auditory and visual multistability under certain conditions correlate (Denham et al., 2018; Einhäuser et al., 2020; Kondo et al., 2012) and under other conditions do not correlate (Hupé et al., 2008; Pressnitzer & Hupé, 2006). The contrariness in these and other findings raised the question of whether perceptual multistability is controlled by a distributed mechanism (modality- specific and local), a common mechanism (modality-independent and global), or some sort of mixture between both: a hybrid mechanism (conversing stages of global and distributed processing). Furthermore, it was an aim of this Thesis to examine the behavioral similarities in the switching pattern in one pair of ambiguous auditory and visual stimuli. Potentially, functionally similar principles in separate mechanisms are, in fact, coupled crossmodally in a single process. For that purpose, we did not rely solely on the direct manual report of the current percept monitored through vigilant introspection but utilized two reflexive, brainstem-connected physiological markers in the eye, namely the pupil dilation and the OKN. Such markers allow the indirect tracking of perceptual transitions (pupil dilation) and the momentary perceptual states (OKN) report. The OKN validated the reported perceptual sequence and enabled us to concurrently capture the bistable perception in a second modality, audition, via manual report. An attentional bottleneck in the dual report in parallel was circumvented. As stimulus material, we applied auditory streaming to the auditory modality (pupil dilation) and moving plaids to the visual modality (evoked the OKN).

In Study 1, we complemented an experiment on the pupil dynamics during perceptual switches in visual bistability (Einhäuser et al., 2008). For that purpose, we conducted two similar experiments 154 C5 General Discussion in Study 1: In the first experiment, we recorded the pupil during perceptual switches evoked by an ambiguous auditory streaming stimulus. The number of instructed perceptual states, the percept space, was increased to four instead of two in the foregoing study (Einhäuser et al., 2008). In fact, as observed for the reported perceptual switches in the visual ambiguous stimuli (structure from motion, Necker Cube, moving plaids, Einhäuser et al., 2008), reported perceptual switches in auditory multistability go along with a dilation of the pupil at the time of the reported switch. We distinguished experimentally between endogenous (internal perception-driven) and exogenous (external stimulus-driven) perceptual switches. Exogenous perceptual switches were evoked by replaying the randomized dominance pattern of the percepts of a preceding trial to the participant with a special stimulus. In such trials (replay-active), the stimulus was a subtly disambiguated version of the auditory streaming stimulus (chirp sounds) that mimicked the four percepts. The pupil dilated stronger for exogenous than endogenous perceptual switches. In the second experiment of Study 1, the dilation of the pupil in response to endogenous perceptual switches was replicated and counter-checked for the motoric execution functions and button- press procedure, which facilitated the report of perceptual switches. In the so-called random condition, the random pressing of the four buttons, which were otherwise used to signal the four percepts, evoked a similar pupil dilation as the reported endogenous perceptual switches did. That might hint that the pupil dilation in response to reported endogenous perceptual switches is confounded with a response to the report, rather than the perceptual switch. The replay condition was extended in Experiment 2: the removal of the contaminations of the pupil dilation by the physical changes of the stimulus in the exogenous perceptual switches (replay-corrected) was not substantially changing the originally found pattern. Externally evoked perceptual switches would still go along with a stronger dilation of the pupil compared to the internally evoked (endogenous) perceptual switches. The pupil dilation in response to the unreported exogenous perceptual switches, which was used in a subtraction approach to remove contaminations from the original replay condition (replay-active with report), signaled that the pupil dilates for purely perceptual decisions (with detection and classification, but without report). As well, the pupil dilated stronger in response to exogenous perceptual switches than for random button-presses, which is seen as a perceptual component. The significant pupil dilation that Einhäuser et al. (2008) found in visual bistable stimuli, and only numerically for auditory bistable stimuli, is now complemented by our findings in Study 1 with a significant pupil dilation in response to perceptual switches during auditory multistable stimuli. Whether that is the result of the increased number of percepts or the methodological advances applied is not yet retraceable and motivates future experiments. It is conceivable that the pupil dilation in response to perceptual switches is significant in both experiments of Study 1 because the uncertainty in the four-alternative choice task might be bigger than in the two-alternative choice task. The increased uncertainty might evoke the pupil dilation

155 C5 General Discussion at the time of the report so that eventually, it is statistically easier to capture. That testable hypothesis should be addressed in future studies. The pupil dilation in response to a random pressing task that is indistinguishable from the pupil dilation in response to endogenous perceptual switches, and the experimental inconsistency (increasing number of percepts) with earlier studies, limit the explanatory power of Study 1 in the bigger picture. Assuming that the identified pupil dilation is, in fact, perception-driven, the pupillary response to perceptual switches in ambiguous auditory stimuli completes the findings on ambiguous visual stimuli. Such a multimodal sensitivity of the pupil to perceptual decision and selection in multistable situations is seen as evidence for the involvement of the locus coeruleus (brainstem) in multimodal multistability. The neural underpinnings, like the locus coeruleus, are interpreted as a partly common mechanism governing perception in a suspected top-down manner (Scocchia et al., 2014). If only modality-specific local sensory processing levels were involved in the perceptual switching, such autonomy would argue in favor of a distributed mechanism (Hupé et al., 2008). Such independence would be regarded likely if only either auditory or visual multistability has a measurable effect on the dilation of the pupil at the time of the reported perceptual switch. Since the pupil is sensitive to numerous cognitive processes, it would have been important that the effect of the reported endogenous perceptual switch is distinguishable from the pupil dilation to the control condition for the report, as it was for visual multistability in the study of Einhäuser et al. (2008). Besides the involvement of the locus coeruleus, which is equally active in reported perceptual switches in vision and audition, other brain regions stand for being involved in perceptual multistability globally and have been studied in brain-imaging and electrophysiological studies: the frontoparietal cortex is active for perceptual switches in vision (Başar-Eroglu et al., 1996; Frässle et al., 2014; Strüber et al., 2000; Weilnhammer et al., 2013), but not necessarily in audition (Kondo et al., 2018; Sanders et al., 2018). Instead, auditory bi- and multistable conscious perception may correlate with neuronal activity in the frontotemporal network (Brancucci et al., 2016), in the intraparietal sulcus (Cusack, 2005), and in the neurotransmitter concentration of the auditory cortex and the inferior frontal cortex (Kondo et al., 2017). Such a convolute of results mirror the widespread distributed brain-regions that get activated during perceptual tasks and by no means can be called contradictory. Instead, the findings are dependent on the specific perceptual task and highlight the parallel distributed processing of stimuli and goal- related aspects of such experimental situations (McClelland et al., 1986).

If perceptual multistability in vision and audition evoke similar pupillary and neuronal correlating responses, that does not necessarily mean that perceptual switches are triggered in a common mechanism or unifying neuronal substrate. Instead, task-related processes (e.g., attention, response generation, decision criteria) of the administered paradigm in the detection of switches,

156 C5 General Discussion the classification of the percept, and in many cases the active report, might evoke similar neuronal activity that might be detected by researchers in search of the “center for perceptual switches”. This fallacy is not limited to pupillary responses but can also be found in other studies investigating neuronal substrate (Hess & Polt, 1964; Hoeks & Levelt, 1993). In some studies, it is discussed that the frontal activity, which is time-locked to reported perceptual switches, is rather emerging from report processes than from percept selection itself (Brascamp, Sterzer, et al., 2018; Frässle et al., 2014). More recent observations of perceptual transitions without report suggest that high- level functions (in the dorsolateral prefrontal cortex) are rather not involved in the percept competition but only in the reporting process (Knapen et al., 2011; Zou et al., 2016). Taken together, we are yet, in the beginning, to identify the neuronal substrate exclusively involved in the perceptual alternations.

What was still under-represented in the research is the parallel ambiguous stimulation of several modalities at once. Until now, three main streams of research tried to approach the problem: First, it was tried to detect the similarity in timing and the spatial overlap in the neuronal activity (Brancucci et al., 2016). Second, the congruency of reported distributions of dominance durations (e.g., well fitted with lognormal distribution for some visual stimuli, Zhou et al., 2004; or, for some other visual stimuli by a gamma distribution, Carter & Pettigrew, 2003; Devyatko et al., 2017, for a broader discussion see Mill et al., 2013) was measured. And third, the significance of correlations of perceptual switches (often depending on the sample size and the stimulus duration, Einhäuser et al., 2020) across modalities between several different forms of perceptual multistability was explored. Often in such experiments, participants are stimulated unimodally with ambiguous stimuli and afterward compare the measure of perceptual dominance; only few exceptions can be found in the literature (V. Conrad et al., 2010; Hupé et al., 2008). The unimodal observation with the posthoc analysis can, in my opinion, only contribute to some extent to the central questions of this Thesis: Do local processing stages trigger perceptual changes in the formation of conscious percepts?; or, is there a global factor in the perceptual system contributing to switches that stem from, e.g., a central switch generator (e.g., in the brainstem), that is modality-general?; or, how do both processing levels collaborate depending on the specific stimuli ambiguity?

If multistable processes in two modalities are triggered at the same time, and the moment-to- moment binding of percepts or switches is evaluated, the central, the distributed, and the hybrid models proposed for perceptual multistability can be investigated in greater detail (i.e., Conrad et al., 2017; Hupé et al., 2008). We did exactly that in Study 2 and 3, for audiovisual stimulus, to examine the dependence of percepts crossmodally on a moment-by-moment basis. The

157 C5 General Discussion observation of the momentary binding and unbinding of percepts crossmodally allows one to investigate the hidden mechanism. In the distributed model, two completely separated mechanisms (modality-specific for audition and vision) would not systematically bind percepts between two modalities. On the contrary, in the central model of perceptual multistability (modality-general), a central unifying process would moderate the emergence of percepts in vision and audition at once. The perceptual switches would subsequently happen close to each other in both modalities. An even tighter coupling can be assumed if emerging percepts in both modalities exhibit consistency because the “cause” of one percept in both modalities is crossmodally fixed: a cross- modal coupling of percepts is associated with a global mechanism. As introduced in Chapter 1, the binding of percepts might follow certain rules that can stem from the similarities in the satisfaction of the perceptual grouping principles in both modalities (Kubovy & Yu, 2012). Originally formulated for the organization of elements of raw sensory information (Wertheimer, 1938), similar rules of segmentation were found for rhythmic patterns in auditory streams (Bregman, 1990). Ecologically, the coherence of visual and auditory information is often given (e.g., speech in face-to-face conversion). In the multistable process, the global component of the perceptual system might accumulate evidence from other sensory channels to stabilize the momentary percept and reduce the instability. We investigated that question in Study 2 and 3 with a well-studied set of ambiguous stimuli concurrently, namely, the auditory streaming stimulus (similar to the auditory stimulus in Study 1) for the auditory modality and the moving plaid stimulus for the visual sense. The moving plaid stimulus automatically evokes a reflexive eye-movement. The OKN has markedly different properties depending on the momentary dominating percept. Both stimuli are unimodally capable of evoking bistability (two percepts: integrated and segregated, see Study 2 and 3). The aim was to observe the moment-by-moment coupling of the percepts in both modalities (percept “integrated” in both audition and vision, percept “segregated” in both audition and vision). In the framework of a central (a shared global) or a distributed (several distributed local mechanisms) model, the coupling of the percepts signals the tendency of the perceptual system to cross-modal interaction, which is potentially controlled rather by a global mechanism (e.g., multi- or supramodal processing stages). The uncoupling of the percepts would rather speak for a localized process that is not crossmodally communicating. From the research on emotions, it is known that the integration of sensory information (e.g., audio and visual) is realized neither in unimodal nor in multimodal processing stages of the perceptual hierarchy, but in supramodal brain areas like the ventral posterior cingulate and the amygdala (Cauller, 1995; Klasen et al., 2011). Whether a similar global mechanism (at a different location) is supporting the emergence of conscious percepts is an ongoing debate (e.g., Grossberg et al., 2004; Kloosterman et al., 2015; Kornmeier

158 C5 General Discussion et al., 2009; Scocchia et al., 2014; Toppino & Long, 2005). We hope to contribute to this discourse with two studies on bimodal bistability: In Study 2, the evidence in favor of a global mechanism stems from the concurrent consistency between the percepts. The integrated and the segregated percept in vision were accompanied by the integrated and the segregated percept in audition, respectively. The coupling was not ideal, but significantly above chance (M = 53.5%, SD = 6.1%). We implemented a procedure adapting stimulus parameters in the moving plaid (opening angle between the plaids) and the auditory streaming (frequency difference sounds A and B) stimulus to facilitate the cross-modal coupling and permit fair testing of the coupling hypothesis. The imperfect coupling might stem from different sources of noise and does not exclusively need to be caused by the perceptual system itself but, e.g., by low decoding accuracy, insufficient parameter adaption, report, and the asynchrony of stimulus cycles: in audition, one cycle of unique sensory information was 0.6 s, and in vision, 1.3 s. With the aid of a machine learning approach (i.e., SVM), the momentary visual percept was decoded from the high-dimensional features of OKN. All the attentional resources of participants could be focused on their momentary auditory percept. In this vein, attentional biases and an attentional bottleneck in the concurrent report of percepts can be ruled out. Challenges originating from a parallel report of the current percept in two modalities (dual-task) have imposed earlier audiovisual tests on cross-modal binding (V. Conrad et al., 2010; Hupé et al., 2008). With the applied no-report paradigm (OKN read-out), we might overcome most of such limitations. The direct perceptual binding between multistable processes is most likely linked, in parts, by a global structure (Horlitz & O’leary, 1993; Leopold & Logothetis, 1999; Schwartz et al., 2012). The exact properties and the location of the neuronal substrate of the global structure are subject to controversy, but the direct perceptual coupling recorded in our study can be explained by a unifying global structure. These results are contrasted against findings that advocated for merely local processing of the sensory modalities while handling multistable perceptual information. The independence stems either from unimodal research of parallel ambiguous figures (Toppino & Long, 1987) or from the limitations of a bimodal dual-task introduced above.

In a second study on bimodal bistability, Study 3 of this Thesis, the properties of the cross-modal binding were further examined. The general binding of the percepts, on a moment-by-moment basis, between the auditory and the visual modality persisted and were replicated. However, the expected transfer of an externally induced visual percept in the visual bistable process to the auditory percept in a parallel ongoing auditory bistable process was unobservable. The absence of the local binding is interpreted as a weak or slow coupling between multistable processes in different modalities or a unidirectional coupling of the two bistable processes, which might be

159 C5 General Discussion auditory-lead; what needs in sum further investigations. A third alternative (influencing the coupling) might be the dependence of the respectively to be reported modality. In the likely case of a weak (and maybe bidirectional) coupling, the external disruption of one bistable process with a disambiguated sensory information leads to a momentary decoupling from the parallel bistable process. These open questions (coupling strength and direction) need to be systematically addressed in future studies. A starting point is to induce a certain auditory percept externally (e.g., by subtle disambiguation with the chirp sounds) and monitor the response on the visual percept. In this way, the directionality and the strength of the cross-modal coupling can be clarified beyond the scope of the studies at hand. Even though the cross-modal integration can be observed to resolve ambiguities in either modality (Shams & Kim, 2010), it seems less likely that such an explicit coupling works. As discussed in-depth in Study 3, the flexibility of the perceptual system makes it more likely that the conflict between modalities (inconsistent percepts) is momentarily tolerated. The perceptual system is more flexible than to just fix the percepts consistently together at all times.

In this Thesis, we presented evidence for the responses of the pupil to auditory perceptual switches and reflexive eye-movements (OKN) to visual bistability. The dilation of the pupil signals transitions of the activity in the locus coeruleus, which is overlapping with parts of the brainstem (De Gee et al., 2014; Murphy et al., 2014). In vision, namely binocular rivalry (competition of sensory information in dichoptic displays), the brainstem is thought to synchronize the activity of competing neuron populations (local) and mediate the global output inter-hemispherically (Miller et al., 2000; Pettigrew, 2001). In auditory perception, the electrical brain responses to the onset of auditory events are as well starting off at the brainstem (Itoh et al., 2001; Mill et al., 2013). The brainstem system is known for its involvement in consciousness, multisensory integration, and arousal (Basura et al., 2012; Jaschke, 2019; Stein, 2012) and the perceptual deficiency arising from damages to it (J. Conrad et al., 2020). The reflexive responses of the OKN evoked by the moving plaids and its changing properties along the momentary visual percept are similarly relying on the functionality of the brainstem (Bense et al., 2006; Strupp et al., 2011). In the course of this Thesis, evidence for the involvement of neuromodulatory brainstem regions to auditory (pupil dilation, Study 1) and visual (OKN, Study 2 and 3) bi- and multistability was presented. The findings are substantiated by moment-by-moment coupling of the consistent percepts in two audiovisual bistable studies that relied on behavioral measures of perceptual dominance. If a shared mechanism is indeed contributing to perceptual switches, the location of such a mechanism might be in the brainstem: The arousal system of the locus coeruleus is modulating noradrenergic level and has proven to be a crucial control center for the post-decisional consolidation in perception (Einhäuser et al., 2010; Nieuwenhuis et al., 2005). Consolidation is

160 C5 General Discussion referring to the selection of an outcome that a cognitive decision has (Aston-Jones & Cohen, 2005). In the perceptual decision process of ambiguous stimuli, several percepts compete for the momentary dominance of the consciousness. All percepts are individually compatible with the sensory evidence and conflict because they obviate each other (Drew et al., 2019; Szmalec et al., 2008). The task of the perceptual system is to decide on one percept. Conflict monitoring and response resolution (along a pupil dilation paradigm) were recently attributed to the locus coeruleus noradrenergic arousal system (Grueschow et al., 2020). Conflict monitoring is a concept stemming from the cognitive control framework (Botvinick et al., 2001). Cognitive control is describing the complex cognition responsible for directing goal-oriented adaptive behavior to situation-specific requirements (Kan et al., 2013): similar behavior can be found in all our studies, e.g., “switch the report button if percept and button are no longer coherent”. In a series of experiments with perceptual (Necker cube) and verbal stimuli (Stroop), Kan et al. (2013) could identify cross-task conflict adaption. They used different tasks like the Stroop task, ambiguous and unambiguous sentence reading, ambiguous and disambiguated Necker cubes, reviewed the accuracy and reaction time sequentially. Conflict adaption refers to the triggered cognitive detection mechanism in one domain that enhances the conflict resolution in a different following domain. Classically, in the Stroop task, two conditions are presented. In the congruent case, the shown word (e.g., “red”) matches the font type color (e.g., red); in the incongruent case, a different font type color (e.g., red) is applied. A Stroop task (besides many others, e.g., flanker task) is relevant to cognitive control theory because conflict and adaption to conflicts in the font type color need to be monitored, detected, and resolved. Even though the ambiguity does not persist, as in stimuli like auditory streaming or the moving plaids, the incongruency between the word and the text color exhibits ambiguity. Conceptually, the word refers to a color name that is stored in the memory which is not congruent with the momentary perceived color information. The cognitive control theory might be applicable to perceptual ambiguity because conflicting percepts and this conflict is renewed continuously by the stimulus ambiguity. The findings of Kan et al. (2013) suggest a domain-general mechanism because the adaption performance between several tasks (syntactic to the non-syntactic domain and perceptual to the verbal domain) in a sequence was improved. The study cannot provide evidence for a single cognitive control mechanism for all domains, but the authors propose that representational conflicts are resolved across domains by the same mechanism. Neuronally, different demands in the cognitive control of a manual Stroop task could modulate the activity of the locus coeruleus and further brain regions of healthy individuals (S. Köhler et al., 2016); a finding that could not be replicated for medicated individuals suffering from a dysfunctional neurotransmitter system (S. Köhler et al., 2019). In an auditory Stroop task with different behavioral response types (two verbal and one manual), the fronto-central negative polarity wave

161 C5.1 Significance for models of perceptual multistability and implications for the perceptual architecture was similarly enhanced as in the visual Stroop task, advocating for a supramodal general mechanism (Donohue et al., 2012). Similar domain-independent resources for cognitive control can be found for unambiguous stimuli in a series of tasks (attention shifting and categorization switches) in brain-imaging studies outside the brainstem (Chiu & Yantis, 2009). The cognitive control and conflict monitoring theory fit in the hybrid model of perceptual multistability as the high-level component which consolidates decisions on the low-level sensory representations (Long & Toppino, 2004; Toppino & Long, 2005). In this framework, the conflict resolution might have a supramodal source (e.g., locus- coeruleus norepinephrine system) that signals the low-level sensory consolidation of a perceptual decision or enables the consolidation in a top-down manner of the perceptual competition in favor of a percept (van der Wel & van Steenbergen, 2018).

C5.1 Significance for models of perceptual multistability and implications for the perceptual architecture

The perceptual processing hierarchy The perceptual system is constantly faced with the task of providing a meaningful representation of what is happening in the surrounding visual and auditory environment. The representational process is not a mere camera- or microphone- like a hard-wired adamant neuronal circuit that is providing inter-subjectively one-to-one conscious perceptions. Instead, the perceptual organization is subjective, trained, learned, and shaped by attention, expectation, mood, and other dynamic factors (Gibson, 2014). Some individuals may focus on details; others might take a more holistic stance (Nisbett & Miyamoto, 2005). Naturally, both perspectives are useful to fulfill markedly different tasks, e.g., the detection of breast cancer in a mammogram (Kundel et al., 2007) or face recognition (Rossion, 2013). Classically, the perceptual hierarchy is split into several theoretical processing stages, which are complex to discriminate (Mechelli et al., 2004). First, the raw sensory information stimulates the sensor. The sensory information is forwarded in synaptic networks from low-level, early sensory stages to higher cortical areas that represent complex and abstract cognitive conscious functions (e.g., selective attention). Low-level perception, i.e., in the primary visual and auditory cortex, is built on neural computations that infer sensory information in unimodal pathways (Hudetz, 2006). Early sensory processing is driven rather by stimulus properties (e.g., saliency) than by the awareness and happens autonomously (Dehaene & Changeux, 2011) before the information streams converge later through functional connectivity in cross-modal processing in the consciousness and form a multimodal presentation. Such low- level coaction in perception is complemented by high-level cortices (e.g., modulations of the 162 C5.1 Significance for models of perceptual multistability and implications for the perceptual architecture frontoparietal cortex). The interplay between low-level and high-level networks is split classically into bottom-up (salience of stimulus-features) and top-down (goal-driven) factors. Both levels communicate with each other in a distributed, multilayered feedforward hierarchy and in cortico- cortical feedback processes (Frégnac & Bathellier, 2015; X. Xu et al., 2020). The further decomposition of the functionality of the perceptual system might enable us to understand how stable and meaningful perceptual representation reaches consciousness through the perceptual hierarchy. In the orchestra of neuronal assemblies that represent the increasingly “objectified” sensory signals, ambiguous figures are seen as a tool to activate the perceptual system in a “stable, unstable” state (Zeki, 2004). Perceptual multistability is rarely perceivable in day-to-day life but allows conclusions on the complex give-and-take beyond the reaction to a physical change in the sensory information. Instead of responses to exogenous transitions (stimulus-driven), evoked perceptual multistability exposes perceptual dynamics that are driven foremost by the hidden architecture itself and less by modulations in the sensory signal.

Implications for models of perceptual multistability As introduced in Chapter 1, one prominent local model of perceptual multistability is based on reciprocal inhibition and adaption of neuronal groups (Blake, 1989). Each percept is coded for by a different set of low-level sensory neuronal groups that dominate as long as they inhibit the competing group. At the same time, the neurons of the currently dominating percept adapt until the neurons of the other percept take over. Subsequently, the just suppressed neuronal group inhibits the newly dominant percept. In this way, the cognitive representation switches again and again. Another ingredient in the produced percept sequence is noise that engages in the activity of the percept neurons and shapes the variability in perception. Noise can stem from different sources, e.g., neuronal noise (R. Cao et al., 2014, 2016). The three building blocks (noise, inhibition vs. adaption) find support in imaging, modeling, and psychophysical studies across modalities (Denham et al., 2018). If the same universal mechanism is found in all senses separately, distributed on lower levels of the sensory pathway, researchers can be misled to understand the similarities in the reported perceptual experience (e.g., by introspection) and no-report correlates of consciousness as evidence for one unifying global process. A question, we propose, that can be better addressed by a parallel stimulation with ambiguous stimuli in several modalities: in Studies 2 and 3, we can observe a general coupling of percepts between the visual and the auditory bistable processes. Low-level sensory levels facilitate the emergence of the percepts, but a distinct global process that integrates sensory information to select and to stabilize percepts crossmodally seems to be involved as well. The global mechanism can be either implemented by top-down streams or by the early intermodal links (e.g., phase reset) (Driver & Noesselt, 2008; Kayser et al., 2008; Palva et al., 2005).

163 C5.1 Significance for models of perceptual multistability and implications for the perceptual architecture Unimodal “top-down modulations” are suspected in visual and auditory multistable perception equally (Einhäuser et al., 2020). The prevalence of task-related effects on the cross-modal correlations in perceptual dominance, like response strategies (personality: switching-type, stable- type) and idiosyncratic varying decision (early-response, late-response) thresholds cannot be completely ruled out (Gallagher & Arnold, 2014). The purpose of the top-down signals can be twofold and is the subject of controversy: either they stabilize the just emerged percept (Jazayeri & Movshon, 2007), and/or they destabilize the currently dominating percept (Nienborg & Cumming, 2009). Additionally, the influence of voluntary control can be exerted across modalities to a limited extent (Denham et al., 2018; Kondo et al., 2018). Different degrees of instructed, voluntary control (bias prolongation “prolong the current percent”, bias shortening “interrupt the current percept”) that can be utilized to equally auditory and visual multistable processes failed to be completely explained by attention-related top-down streams. Since the volitional control could be demonstrated for both modalities, it was assumed that the correlations are stemming from the attentional bias that can influence both modalities in a top-down manner. Instead, the crossmodal correlations were found similarly in the condition where none of the two biases were instructed (i.e., the neutral condition). The experimental findings are in line with results of computational modeling, which attribute, in fact, some limited biasing power to attention which cannot explain the competition for dominance in perception as a whole (Koene, 2006; Van Ee et al., 2009).

A more hybridized perspective on the phenomena is offered by the predictive coding theory to perceptual multistability (Friston & Kiebel, 2009; Kleinschmidt et al., 2012; Weilnhammer et al., 2017; Winkler et al., 2012). The idea behind predictive coding is that the brain is dominated by an inferential process that tries to make “sense” and detect the “causes” of the sensory input received continuously. The stable sensory information in perceptual multistability triggers the brain to constantly come up with new ways of “seeing” or “hearing” the current sensory information at different stages of the perceptual process, which is thought to be iterative. The unconsciously computed perceptual inferences of an ambiguous stimulus lead to equally high probabilities for different percepts that subsequently switch. The mechanism would be steered by a step-wise higher-level process that tries to single out of the ambiguous sensory information what representation could be causal. In predictive coding, feedforward and feedback streams between low- and high-level brain regions communicate on the coherence between the sensory information with the momentary perceptual representation. The prediction error is the difference between the top-down prediction and the bottom-up sensory input. Feedback of the error is considered in updating the next prediction. The continuous stimulation with an ambiguous stimulus is constantly questioning the current prediction or mental representation. This model allows predictions that

164 C5.1 Significance for models of perceptual multistability and implications for the perceptual architecture are empirical testable and are conceptually related to the hybrid model (Toppino & Long, 2005). Here, perceptual switches are dominantly driven by the global process computing probabilities, and the sensory level facilitates the constant competition. Switching between the perceptual alternatives is reflected by a small fraction of sensory cells in the early cortical pathway, whereas the majority of changes between percepts were found in later stages (e.g., for binocular rivalry Lehky & Maunsell, 1996). Whether such minimal low-level sensory activities are compatible with a high-level down-stream that selects percepts and modulates early cortical areas is an ongoing debate (Kornmeier & Bach, 2009).

The complex interaction between bottom-up and top-down components has motivated a unifying theoretical framework: the hybrid model (Toppino & Long, 2005). The hybrid model (of figural reversal) mostly deals with the but may allow similar explanations for the auditory subsystem or the perceptual system as a whole because it incorporates non-sensory cortical processing levels. The model (Proposal 2, p. 761) combines bottom-up (stimulus-driven) and top- down (concept-driven) findings that influence each other in recurrent connections (see predictive coding). The localized neuronal structure (bottom-up) produces percepts that interact with high- level cognitive stages on intermediate stages to resolve the momentary perceptual competition. The intermediate stage might be a representational level that allows the “antagonistic relationship between competing neural networks” (p. 762). The loose binding of the percepts between an auditory and a visual bistable process is evidence in favor of a hybrid model (Study 2). We can also see in the momentary decoupling of percepts in Study 3 on fixed time points (inducers) that the coupling coefficient of the bistable processes is not as robust (or bidirectional) as one single global monolith-like mechanism would enforce. Externally induced percepts in multistability were not forwarded from the visual to the auditory perception. Whether the mediation of cross-modal integration is happening on multimodal or supramodal cognitive levels remains elusive and needs further investigation. The momentary independence between modalities, as it was found in Study 3, is still somewhat compatible with the hybrid model which accounts for the bottom-up autonomy and the flexibility of the perceptual system to handle sensory information that has no common source at all: e.g., imagine a mundane situation, where a person listens to a recorded conversation over the headphones and observes visually two strangers having a conversation. The global mechanism is not ruling out the flexibility and openness to low-level sensory evidence.

The discoveries produced by perceptual multistability have contributed greatly to the understanding of the perceptual system (Leopold & Logothetis, 1999). The exchange of local sensory processing stages with high-level, global, and cognitive factors during perceptual

165 C5.2 Recommendations for future research multistability generate insightful research questions on the perceptual process, which has inspired this Thesis. The hybrid model is an integrative approach embracing both distributed and global models of perceptual multistability (Toppino & Long, 2005). At the same time, on a critical note, the hybrid model is not able to distinguish the exact coordination of local and global factors on the intermediate stages of perceptual processing (potentially at representational level, as Toppino & Long proposed in 2005) but overcomes the bottom-up vs. top-down dichotomy that is called by some researchers “fuzzy” or “failed theory” (Awh et al., 2012; Rauss & Pourtois, 2013; Theeuwes, 2010). The reasoning behind the criticisms is addressing the sometimes arbitrary, sometimes impossible attribution of effects as “saliency-driven” (bottom-up) or “goal-driven” (top- down), which is especially a sophisticated case in multistable perception (Kornmeier et al., 2009).

C5.2 Recommendations for future research

Following the hybrid model We have made study-specific suggestions in Chapters 2 to 4, how to address emerged issues in capturing the true perceptual effects on pupil dilation in auditory multistability and how to test the cross-modal coupling in audiovisual bistability further. The question of whether perceptual multistability is a locally distributed mechanism that is independent of global structures predominantly or a global mechanism that is of modulatory, de- and stabilizing, and selecting influence on low-level sensory stages cannot be answered within a dichotomic framework but should be addressed within a multimodal hybrid model, as I discussed in the foregoing subsection. That said, it will not become any easier to decipher the integrative perceptual process because findings cannot be any longer simply attributed to either end of the spectrum. The task-specific implications for the supposed model need to be considered more rigorously. In the constructions of the experimental paradigms, it shall be regarded that either component and their respective interaction have the chance to emerge in the resulting pattern.

No-report paradigms: friend or foe? No-report paradigms allow removing confounding influences that executive response functions might have on the different indirect (e.g., neuronal and pupillary) correlates of perceptual dynamics (Tsuchiya et al., 2015). Originally, the subjective experience is directly reported via button-press or verbalization, but that has numerous limitations (e.g., cognitive functions get activated that are unconnected to perception). For example, the pupil dilation is susceptible to the executive function controlling the manual report of the percept by button-pressing (Hess & Polt, 1964; Hupé et al., 2009). Contaminations of the dilation of the pupil to perceptual switches

166 C5.2 Recommendations for future research in multistability due to the reporting task are existent. Similar contaminations can be found for other neuronal markers and might even dominate the measured signal. As discussed above, the activation in the frontoparietal cortex was attributed to reporting tasks predominantly, and the perceptual switches might have happened in the absence of these executive processes (Brascamp, Blake, et al., 2015; Brascamp, Sterzer, et al., 2018). In contrast, no-report paradigms allow to remove a substantial executive component in the cognitive control that might overshadow true cross-modal correlations in the perceptual, behavioral dominance measures by other general cognitive factors (e.g., under-/over-reporter; somebody that reports too little/too much by personality). We adopted a no-report paradigm throughout all studies: in Study 1, we captured the pupil response to exogenous perceptual switches (what is still impossible for the endogenous perceptual switches), and in Study 2 and 3, we derived the momentary percept from the eye- movements. However, the “aware” monitoring and classification (e.g., memory, decision criteria) of the instructed percept remains involved in no-report paradigms, and those processes themselves are not immune to confound perception (Overgaard & Fazekas, 2016). No-report-related cognitive processes might leave a similar or even more pronounced disguising signature on the neuronal correlates and, potentially, on the pupil dilation as well (Vandenbroucke et al., 2014). How far do we need to go in stripping down the experimental task until we remain only with the “true perceptual component” of the hidden process? The conscious experience, attention, and active introspection are, by definition, belonging to what is in general considered as the top-down component of the perceptual process. By removing high-level executive functions, one must be careful not to systematically suspend cognitive components that account for or are involved in the global mechanisms the researcher is interested in. In perceptual multistability, the switches between different perceptual states are inevitable. The consciousness seems to be subordinate in perceptually ambiguous situations; an undisguised view on lower processing stages is thought to be possible. Fast and automatic processing is often attributed more to bottom-up processing (Dehaene & Naccache, 2001; Moors & De Houwer, 2006). Since the percepts alternate in consciousness and only limited volitional control is applicable, the global process binding perceptual switches and percepts might be located in the middle ground between the ambiguous low-level unimodal representations and inadvertent switching between abstract representation in awareness. Therefore, I believe it is time to take a closer look at those intermediate stages by applying multiple no-report paradigms at once without further instructions. Instructions influence and shape the expectations of the participant. However, the perception still functions without an instructed “task”, which might allow keeping track of the covert processes. Instructions allow to narrow down the possible wider perceptual experience to a percept space that can be handled more easily. If participants decided themselves how many perceptual states they individually perceive, the comparison of the individual percept patterns

167 C5.3 Conclusion would be difficult to impossible. Here, also psycho-philosophical concepts such as intentionality become eminent (Tschacher & Haken, 2007).

The next question is to further understand the dynamic balance of local and global factors in the perceptual processing hierarchy by mapping the commonalities and differences of a wider spectrum of ambiguous stimuli. Are our findings universal or limited to the specific set of ambiguous stimuli applied? What characterizes these specific sets compared to other stimuli? The reason for the coupling of certain percepts crossmodally should be better understood to allow a conclusive statement on the global process (for a discussion, see Kubovy & Yu, 2012). Potentially, the mechanism leading to perceptual switches is implemented on locally distributed sensory-specific stages of the processing as well as on global stages of the perceptual system. In this way, the perceptual system demonstrates its openness to the infinite varieties in ecological stimuli that, thus, can be captured adequately (e.g., low-level and high-level ambiguities).

C5.3 Conclusion

We conducted three experimental studies tracking reflexive pupillometric, oculomotoric, and behavioral responses. We found similarities of the pupillary responses to switches in auditory perceptual multistability (unimodal), as they were found earlier by Einhäuser et al. (2008) for visual bistability (unimodal). For that reason, we suppose that the common mechanism governing perceptual multistability is not only locally distributed for every modality independently, but a supramodal unifying stage in perceptual processing is involved in the updating of the momentary percept. Maybe a low-level mechanism is existent; however, such a mechanism is partly accompanied by a partly global mechanism. Whether that shared mechanism is a direct correspondence between low-level sensory networks or is modulated in a top-down cognitive regime that consolidates perceptual decision crossmodally on a representational level cannot be answered with certainty within this Thesis. The pupillary responses are driven by the neurotransmitter balances in the brainstem areas, which are supposedly triggered by supramodal perceptual events (Walz et al., 2013). However, the balance could merely capture signals from the low-level distributed and local competition on the sensory level, instead of having an influence on the perceptual switches themselves. Further, evidence for a bimodal cross-talk of bistable processes hints that a global process is shared between modalities and is evoking perceptual switches and dominating percepts at the same time. The found coupling is not strong enough (or is missing bidirectionality) to forward the externally induced percept from the visual to the auditory bistable process. This shows how a

168 C5.3 Conclusion global process can be engaged in an audiovisual coupling when perceptual multistability can still separately persist in the modalities simultaneously and is not necessarily dependent on one single global factor at all times. Besides the contributions to the understanding of perceptual multistability summarized above, we developed several methodological advances in the course of the experiments to study perception: First, we developed an easy-to-distinguish disambiguated version of the auditory streaming paradigm (Study 1, chirp sounds). We used that stimulus to evoke exogeneous perceptual switches and verified the understanding and cooperation of the participants in the subsequent studies for the auditory modality (Study 2 and 3). Second, we individually adapted the pupil diameter to a mid-level pre-experiment in Study 1, Experiment 2. The aim of the conduct was to pick a luminance level in the visual setup, which evoked an intermediate pupil diameter to prevent ceiling and floor effects during the experiment. We calculated the theoretical midpoint of the pupil diameter induced on the lowest and highest luminance presented of the display and accordingly chose the luminance level whose (average) pupil diameter exhibited the smallest difference to midpoint pupil-diameter. Third, we replicated and improved a strategy to read the momentary perceptual state during the bistable perception of a moving plaid stimulus from the induced eye- movements with the help of a machine learning approach (Study 2 and 3). We successfully used the procedure proposed by Wilbertz et al. (2018) as a no-report instrument to capture the visual percept without manual report. Simultaneously, the current percept was manually reported for the auditory bistability evoked by an auditory streaming stimulus. In this vein, we could observe the moment-by-moment coupling of perceptual states between both modalities. We overcame earlier, potentially split-attention-related restrictions emerging along with a dual-task approach when evoking multistability in several modalities at once (V. Conrad et al., 2010; Hupé et al., 2008). These methodological advances will hopefully improve and inspire future studies on multimodal multistability and thereby contribute to our understanding of this complex phenomenon.

169 References

REFERENCES

References Adams, P. A. (1954). The effect of past experience on the perspective reversal of a tridimensional figure. The American Journal of Psychology, 67(4), 708–710. Adams, P. A., & Haire, M. (1959). The effect of orientation on the reversal of one cube inscribed in another. The American Journal of Psychology, 72(2), 296–299. Alais, D., & Burr, D. (2004). The Ventriloquist Effect Results from Near-Optimal Bimodal Integration. Current Biology, 14(3), 257–262. Alsius, A., & Munhall, K. G. (2013). Detection of audiovisual speech correspondences without visual awareness. Psychological Science, 24(4), 423–431. Ammons, R. B. (1954). Experiential factors in visual form perception: I. Review and formulation of problems. Journal of Genetic Psychology, 84(1), 3–25. Andrews, T. J., Schluppeck, D., Homfray, D., Matthews, P., & Blakemore, C. (2002). Activity in the fusiform gyrus predicts conscious perception of Rubin’s vase-face illusion. NeuroImage, 17(2), 890–901. Arieli, A., Sterkin, A., Grinvald, A., & Aertsen, A. D. (1996). Dynamics of ongoing activity: explanation of the large variability in evoked cortical responses. Science, 273(5283), 1868– 1871. Aston-Jones, G., & Cohen, J. D. (2005). An Integrative Theory of Locus Coeruleus- Norepinephrine Function: Adaptive Gain and Optimal Performance. Annual Review of Neuroscience, 28(1), 403–450. Attneave, F. (1971). Multistability in perception. Scientific American, 225(6), 62–71. Awh, E., Belopolsky, A. V, & Theeuwes, J. (2012). Top-down versus bottom-up attentional control: A failed theoretical dichotomy. Trends in Cognitive Sciences, 16(8), 437–443. Bakouie, F., Pishnamazi, M., Zeraati, R., & Gharibzadeh, S. (2017). Scale-freeness of dominant and piecemeal perceptions during binocular rivalry. Cognitive Neurodynamics, 11(4), 319– 326. Barlow, H. (1990). Conditions for versatile learning, Helmholtz’s unconscious inference, and the task of perception. Vision Research, 30(11), 1561–1571. Barniv, D., & Nelken, I. (2015). Auditory streaming as an online classification process with evidence accumulation. PLOS ONE, 10(12), Article e0144788. Başar-Eroglu, C., Strüber, D., Kruse, P., Başar, E., & Stadler, M. (1996). Frontal gamma-band enhancement during multistable visual perception. International Journal of Psychophysiology, 24(1–2), 113–125. Basirat, A., Schwartz, J. L., & Sato, M. (2012). Perceptuo-motor interactions in the perceptual organization of speech: Evidence from the verbal transformation effect. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1591), 965–976. Basura, G. J., Koehler, S. D., & Shore, S. E. (2012). Multi-sensory integration in brainstem and auditory cortex. Brain Research, 1485, 95–107. Beatty, J., & Lucero-Wagoner, B. (2000). The pupillary system. In J. T. Cacioppo, L. G. Tassinary, & G. G. Berntson (Eds.), Handbook of psychophysiology (pp. 142–162). Cambridge University Press. Benjamini, Y., & Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B (Statistical Methodology), 57(1), 289–300.

170 References

Bense, S., Janusch, B., Vucurevic, G., Bauermann, T., Schlindwein, P., Brandt, T., Stoeter, P., & Dieterich, M. (2006). Brainstem and cerebellar fMRI-activation during horizontal and vertical optokinetic stimulation. Experimental Brain Research, 174(2), 312–323. Biederman, I., & Cooper, E. E. (1992). Size invariance in visual object priming. Journal of Experimental Psychology: Human Perception and Performance, 18(1), 121–133. Binder, J. R., Liebenthal, E., Possing, E. T., Medler, D. A., & Ward, B. D. (2004). Neural correlates of sensory and decision processes in auditory object identification. Nature Neuroscience, 7(3), 295–301. Blake, R. (1989). A Neural Theory of Binocular Rivalry. Psychological Review, 96(1), 145–167. Blake, R., Brascamp, J. W., & Heeger, D. J. (2014). Can binocular rivalry reveal neural correlates of consciousness? Philosophical Transactions of the Royal Society B: Biological Sciences, 369(1641), Article 20130211. Borsellino, A., De Marco, A., Allazetta, A., Rinesi, S., & Bartolini, B. (1972). Reversal time distribution in the perception of visual ambiguous stimuli. Kybernetik, 10(3), 139–144. Botvinick, M. M., Braver, T. S., Barch, D. M., Carter, C. S., & Cohen, J. D. (2001). Conflict monitoring and cognitive control. Psychological Review, 108(3), 624–652. Bradley, M. M., Miccoli, L., Escrig, M. A., & Lang, P. J. (2008). The pupil as a measure of emotional arousal and autonomic activation. Psychophysiology, 45(4), 602–607. Brainard, D. H. (1997). The Psychophysics Toolbox. Spatial Vision, 10(4), 433–436. Brancucci, A., Lugli, V., Perrucci, M. G., Del Gratta, C., & Tommasi, L. (2016). A frontal but not parietal neural correlate of auditory consciousness. Brain Structure and Function, 221(1), 463–472. Brascamp, J. W., Becker, M. W., & Hambrick, D. Z. (2018). Revisiting individual differences in the time course of binocular rivalry. Journal of Vision, 18(7), Article 3. Brascamp, J. W., Blake, R., & Knapen, T. (2015). Negligible fronto-parietal BOLD activity accompanying unreportable switches in bistable perception. Nature Neuroscience, 18(11), 1672–1678. Brascamp, J. W., Klink, P. C., & Levelt, W. J. M. (2015). The “laws” of binocular rivalry: 50 years of Levelt’s propositions. Vision Research, 109, 20–37. Brascamp, J. W., Sterzer, P., Blake, R., & Knapen, T. (2018). Multistable Perception and the Role of the Frontoparietal Cortex in Perceptual Inference. Annual Review of Psychology, 69, 77–103. Braun, J., & Mattia, M. (2010). Attractors and noise: Twin drivers of decisions and multistability. NeuroImage, 52(3), 740–751. Bregman, A. S. (1990). Auditory scene analysis: the perceptual organisation of sound. MIT Press. Calvert, G. A., & Thesen, T. (2004). Multisensory integration: Methodological approaches and emerging principles in the human brain. Journal of Physiology, 98(1–3), 191–205. Cao, R., Braun, J., & Mattia, M. (2014). Stochastic accumulation by cortical columns may explain the scalar property of multistable perception. Physical Review Letters, 113(9), Article 098103. Cao, R., Pastukhov, A., Aleshin, S., Mattia, M., & Braun, J. (2020). Instability with a purpose: how the visual brain makes decisions in a volatile world. BioRxiv, https://doi.org/10.1101/2020.06.09.142497. Cao, R., Pastukhov, A., Mattia, M., & Braun, J. (2016). Collective activity of many bistable assemblies reproduces characteristic dynamics of multistable perception. Journal of Neuroscience, 36(26), 6957–6972.

171 References

Cao, T., Wang, L., Sun, Z., Engel, S. A., & He, S. (2018). The independent and shared mechanisms of intrinsic brain dynamics: Insights from bistable perception. Frontiers in Psychology, 9, Article 589. Carbon, C. C. (2014). Understanding human perception by human-made illusions. Frontiers in Human Neuroscience, 8, Article 566. Carter, O. L., & Pettigrew, J. D. (2003). A common oscillator for perceptual rivalries? Perception, 32(3), 295–305. Carter, O. L., Snyder, J. S., Fung, S., & Rubin, N. (2014). Using ambiguous plaid stimuli to investigate the influence of immediate prior experience on perception. Attention, Perception, and Psychophysics, 76(1), 133–147. Cauller, L. (1995). Layer I of primary sensory neocortex: where top-down converges upon bottom- up. Behavioural Brain Research, 71(1–2), 163–170. Chang, C. C., & Lin, C. J. (2011). LIBSVM: A Library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), Article 27. Chen, C., & Young, Y. (2003). Vestibular evoked myogenic potentials in brainstem stroke. Laryngoscope, 113(6), 990–993. Chen, Y. C., & Spence, C. (2017). Assessing the role of the “unity assumption” on multisensory integration: A review. Frontiers in Psychology, 8, Article 445. Chiu, Y. C., & Yantis, S. (2009). A domain-independent source of cognitive control for task sets: Shifting spatial attention and switching categorization rules. Journal of Neuroscience, 29(12), 3930–3938. Cicchini, G. M., Mikellidou, K., & Burr, D. C. (2018). The functional role of serial dependence. Proceedings of the Royal Society B: Biological Sciences, 285(1890), Article 20181722. Colavita, F. B. (1974). Human sensory dominance. Perception & Psychophysics, 16(2), 409–412. Conrad, J., Habs, M., Boegle, R., Ertl, M., Kirsch, V., Stefanova-Brostek, I., Eren, O., Becker- Bense, S., Stephan, T., Wollenweber, F., Duering, M., zu Eulenburg, P., & Dieterich, M. (2020). Global multisensory reorganization after vestibular brain stem stroke. Annals of Clinical and Translational Neurology, 7(10), 1788–1801. Conrad, V., Bartels, A., & Noppeney, U. (2010). Audiovisual interactions in binocular rivalry. Journal of Vision, 10(10), Article 27. Conrad, V., Vitello, M. P., & Noppeney, U. (2012). Interactions Between Apparent Motion Rivalry in Vision and Touch. Psychological Science, 23(8), 940–948. Cornelissen, F. W., Peters, E. M., & Palmer, J. (2002). The Eyelink Toolbox: Eye tracking with MATLAB and the Psychophysics Toolbox. Behavior Research Methods, Instruments, and Computers, 34(4), 613–617. Cusack, R. (2005). The Intraparietal Sulcus and Perceptual Organization. Journal of Cognitive Neuroscience, 17(4), 641–651. Darwin, C. J. (1997). Auditory grouping. Trends in Cognitive Sciences, 1(9), 327–333. De Gee, J. W., Knapen, T., & Donner, T. H. (2014). Decision-related pupil dilation reflects upcoming choice and individual bias. Proceedings of the National Academy of Sciences of the United States of America, 111(5), E618–E625. De Marco, A., Penengo, P., Trabucco, A., Borsellino, A., Carlini, F., Riani, M., & Tuccio, M. T. (1977). Stochastic models and fluctuations in reversal time of ambiguous figures. Perception, 6(6), 645–656. Deco, G., Rolls, E. T., & Romo, R. (2009). Stochastic dynamics as a principle of brain function. Progress in Neurobiology, 88(1), 1–16.

172 References

Dehaene, S., & Changeux, J. P. (2011). Experimental and Theoretical Approaches to Conscious Processing. Neuron, 70(2), 200–227. Dehaene, S., & Naccache, L. (2001). Towards a cognitive neuroscience of consciousness: Basic evidence and a workspace framework. Cognition, 79(1–2), 1–37. Denham, S. L., Bendixen, A., Mill, R., Tóth, D., Wennekers, T., Coath, M., Bohm, T., Szalardy, O., & Winkler, I. (2012). Characterising switching behaviour in perceptual multi-stability. Journal of Neuroscience Methods, 210(1), 79–92. Denham, S. L., Bõhm, T. M., Bendixen, A., Szalárdy, O., Kocsis, Z., Mill, R., & Winkler, I. (2014). Stable individual characteristics in the perception of multiple embedded patterns in multistable auditory stimuli. Frontiers in Neuroscience, 8, Article 25. Denham, S. L., Farkas, D., Van Ee, R., Taranu, M., Kocsis, Z., Wimmer, M., Carmel, D., & Winkler, I. (2018). Similar but separate systems underlie perceptual bistability in vision and audition. Scientific Reports, 8, Article 7106. Denham, S. L., Gyimesi, K., Stefanics, G., & Winkler, I. (2013a). Multistability in auditory stream segregation: the role of stimulus features in perceptual organisation. Learning and Perception, 2, 73–100. Denham, S. L., Gyimesi, K., Stefanics, G., & Winkler, I. (2013b). Perceptual bistability in auditory streaming: How much do stimulus features matter? Learning and Perception, 5(2), 73–100. Deroy, O., Spence, C., & Noppeney, U. (2016). Metacognition in multisensory perception. Trends in Cognitive Sciences, 20(10), 736–747. Devyatko, D., Appelbaum, L. G., & Mitroff, S. R. (2017). A Common Mechanism for Perceptual Reversals in Motion-Induced Blindness, the Troxler Effect, and Perceptual Filling-In. Perception, 46(1), 50–77. DiNuzzo, M., Mascali, D., Moraschi, M., Bussu, G., Maugeri, L., Mangini, F., Fratini, M., & Giove, F. (2019). Brain Networks Underlying Eye’s Pupil Dynamics. Frontiers in Neuroscience, 13, Article 965. Dolan, R. J., Morris, J. S., & De Gelder, B. (2001). Crossmodal binding of fear in voice and face. Proceedings of the National Academy of Sciences of the United States of America, 98(17), 10006–10010. Donohue, S. E., Liotti, M., Perez, R., & Woldorff, M. G. (2012). Is conflict monitoring supramodal? Spatiotemporal dynamics of cognitive control processes in an auditory Stroop task. Cognitive, Affective and Behavioral Neuroscience, 12(1), 1–15. Drew, A., Torralba, M., Ruzzoli, M., Fernández, L. M., Sabaté, A., Pápai, M. S., Soto-Faraco, S., Morís Fernández, L., Sabaté, A., Szabina Pápai, M., & Soto-Faraco, S. (2019). Neural evidence of cognitive conflict during binocular rivalry. BioRxiv, https://doi.org/10.1101/2019.12.19.873141. Driver, J., & Noesselt, T. (2008). Multisensory interplay reveals crossmodal influences on “sensory-specific” brain regions, neural responses, and judgments. Neuron, 57(1), 11–23. Einhäuser, W. (2017). The Pupil as Marker of Cognitive Processes. In Q. Zhao (Ed.), Computational and Cognitive Neuroscience of Vision (pp. 141–169). Springer. Einhäuser, W., da Silva, L. F. O., & Bendixen, A. (2020). Intraindividual Consistency Between Auditory and Visual Multistability. Perception, 49(2), 119–138. Einhäuser, W., Koch, C., & Carter, O. L. (2010). Pupil dilation betrays the timing of decisions. Frontiers in Human Neuroscience, 4, Article 18. Einhäuser, W., Methfessel, P., & Bendixen, A. (2017). Newly acquired audio-visual associations bias perception in binocular rivalry. Vision Research, 133, 121–129.

173 References

Einhäuser, W., Stout, J., Koch, C., & Carter, O. L. (2008). Pupil dilation reflects perceptual selection and predicts subsequent stability in perceptual rivalry. Proceedings of the National Academy of Sciences of the United States of America, 105(5), 1704–1709. Einhäuser, W., Thomassen, S., & Bendixen, A. (2017). Using binocular rivalry to tag foreground sounds: towards an objective visual measure for auditory multistability. Journal of Vision, 17(1), Article 34. Einhäuser, W., Thomassen, S., Methfessel, P., & Bendixen, A. (2017). Audio-visual Interactions in Multistable Perception: Evidence from No-report Paradigms. Journal of Vision, 17(10), Article 1215. Enoksson, P. (1963). Binocular rivalry and monocular dominance studied with optokinetic nystagmus. Acta Ophthalmologica, 41(5), 544–563. Fahle, M. W., Stemmler, T., & Spang, K. M. (2011). How much of the “unconscious” is just pre - threshold? Frontiers in Human Neuroscience, 5, Article 120. Farkas, D., Denham, S. L., Bendixen, A., Toth, D., Kondo, H. M., & Winkler, I. I. (2016). Auditory multi-stability: Idiosyncratic perceptual switching patterns, executive functions and personality traits. PLOS ONE, 11(5), Article e0154810. Farkas, D., Denham, S. L., Bendixen, A., & Winkler, I. (2016). Assessing the Validity of Subjective Reports in the Auditory Streaming Paradigm. The Journal of the Acoustical Society of America, 139(4), 1762–1772. Fechner, G. T. (1860). Elemente der Psychophysik (Vol. 2). Breitkopf und Härtel. Flügel, J. C. (1913). The influence of attention in illusions aof reversible perspective. British Journal of Psychology, 5(4), 357–397. Frank, T. D. (2012). Multistable Pattern Formation Systems: Candidates for Physical Intelligence? Ecological Psychology, 24(3), 220–240. Frässle, S., Sommer, J., Jansen, A., Naber, M., & Einhäuser, W. (2014). Binocular Rivalry: Frontal Activity Relates to Introspection and Action But Not to Perception. Journal of Neuroscience, 34(5), 1738–1747. Freeman, E. D., & Driver, J. (2006). Subjective appearance of ambiguous structure-from-motion can be driven by objective switches of a separate less ambiguous context. Vision Research, 46(23), 4007–4023. Frégnac, Y., & Bathellier, B. (2015). Cortical Correlates of Low-Level Perception: From Neural Circuits to Percepts. Neuron, 88(1), 110–126. Friston, K., & Kiebel, S. (2009). Predictive coding under the free-energy principle. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1521), 1211–1221. Gallagher, R. M., & Arnold, D. H. (2014). Interpreting the temporal dynamics of perceptual rivalries. Perception, 43(11), 1239–1248. Gammaitoni, L., Hänggi, P., Jung, P., & Marchesoni, F. (2009). Stochastic resonance: A remarkable idea that changed our perception of noise. The European Physical Journal B, 69(1), 1–3. Geisler, W. S. (2011). Contributions of ideal observer theory to vision research. Vision Research, 51(7), 771–781. Gershman, S. J., Vul, E., Tenenbaum, J. B., Samuel, J., Vul, E., Tenenbaum, J. B., & Gershman, S. J. (2012). Multistability and Perceptual Inference. Neural Computation, 24(1), 1–24. Gibson, J. J. (2014). The ecological approach to visual perception. Psychology Press. Gigante, G., Mattia, M., Braun, J., & Del Giudice, P. (2009). Bistable perception modeled as competing stochastic integrations at two levels. PLOS Computational Biology, 5(7), Article e1000430.

174 References

Girgus, J. J., Rock, I., & Egatz, R. (1977). The effect of knowledge of reversibility on the reversibility of ambiguous figures. Perception & Psychophysics, 22(6), 550–556. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. Wiley. Grossberg, S., Govindarajan, K. K., Wyse, L. L., & Cohen, M. A. (2004). ARTSTREAM: a neural network model of auditory scene analysis and source segregation. Neural Networks, 17(4), 511–536. Grossmann, J. K., & Dobbins, A. C. (2003). Differential ambiguity reduces grouping of metastable objects. Journal of Vision, 43(4), 359–369. Grueschow, M., Kleim, B., & Ruff, C. C. (2020). Role of the locus coeruleus arousal system in cognitive control. Journal of Neuroendocrinology, Article e12890. Hanna, R., & Thompson, E. (2003). Neurophenomenology and the spontaneity of consciousness. Canadian Journal of Philosophy Supplementary Volume, 29, 133–162. Hecht, S. (1924). The visual discrimination of intensity and the weber-fechner law. Journal of General Physiology, 7(2), 235–267. Hedges, J. H., Stocker, A. A., & Simoncelli, E. P. (2011). Optimal inference explains the perceptual coherence of visual motion stimuli. Journal of Vision, 11(6), Article 14. Hess, E. H., & Polt, J. M. (1964). Pupil size in relation to mental activity during simple problem- solving. Science, 143(3611), 1190–1192. Hock, H. S., Kogan, K., & Espinoza, J. K. (1997). Dynamic, state-dependent thresholds for the perception of single-element apparent motion: Bistability from local cooperativity. Perception & Psychophysics, 59(7), 1077–1088. Hoeks, B., & Levelt, W. J. M. (1993). Pupillary dilation as a measure of attention: a quantitative system analysis. Behavior Research Methods, Instruments, & Computers, 25(1), 16–26. Horlitz, K. L., & O’leary, A. (1993). Satiation or availability? Effects of attention, memory, and imagery on the perception of ambiguous figures. Perception & Psychophysics, 53(6), 668– 681. Hsiao, J. Y., Chen, Y. C., Spence, C., & Yeh, S. L. (2012). Assessing the effects of audiovisual semantic congruency on the perception of a bistable figure. Consciousness and Cognition, 21(2), 775–787. Hudetz, A. G. (2006). Suppressing consciousness: Mechanisms of general anesthesia. Seminars in Anesthesia, Perioperative Medicine and Pain, 25(4), 196–204. Hugrass, L., & Crewther, D. (2012). Willpower and conscious percept: volitional switching in binocular rivalry. PLOS ONE, 7(4), Article e35963. Hulse, B. K., Lubenov, E. V, & Siapas, A. G. (2017). Brain state dependence of hippocampal subthreshold activity in awake mice. Cell Reports, 18(1), 136–147. Hupé, J.-M., Joffo, L.-M., & Pressnitzer, D. (2008). Bistability for audiovisual stimuli: Perceptual decision is modality specific. Journal of Vision, 8(7), Article 1. Hupé, J.-M., Lamirel, C., & Lorenceau, J. (2009). Pupil dynamics during bistable motion perception. Journal of Vision, 9(7), Article 10. Hupé, J.-M., & Pressnitzer, D. (2012). The initial phase of auditory and visual scene analysis. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1591), 942–953. Itoh, A., Young Soon Kim, Yoshioka, K., Kanaya, M., Enomoto, H., Hiraiwa, F., & Mizuno, M. (2001). Clinical Study of Vestibular-evoked Myogenic Potentials and Auditory Brainstem Responses in Patients with Brainstem Lesions. Acta Oto-Laryngologica, 121(545), 116–119. Jaschke, A. (2019). Music, maestro, please: thalamic multisensory integration in music perception, processing and production. Music and Medicine, 11(2), 98–107. Jastrow, J. (1900). Fact and fable in psychology. Houghton, Mifflin and Company.

175 References

Jazayeri, M., & Movshon, J. A. (2007). A new perceptual illusion reveals mechanisms of sensory decoding. Nature, 446(7138), 912–915. Jolij, J., & Meurs, M. (2011). Music alters visual perception. PLOS ONE, 6(4), Article e18861. Kan, I. P., Teubner-Rhodes, S., Drummey, A. B., Nutile, L., Krupa, L., & Novick, J. M. (2013). To adapt or not to adapt: The question of domain-general cognitive control. Cognition, 129(3), 637–651. Kashino, M., & Kondo, H. M. (2012). Functional brain networks underlying perceptual switching: Auditory streaming and verbal transformations. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1591), 977–987. Kayser, C., Petkov, C. I., & Logothetis, N. K. (2008). Visual modulation of neurons in auditory cortex. Cerebral Cortex, 18(7), 1560–1574. Kelso, J. A. S. (1996). Dynamic patterns: The self-organization of brain and behavior. MIT Press. Kersten, D., Mamassian, P., & Yuille, A. (2004). Object Perception as Bayesian Inference. Annual Review of Psychology, 55(1), 271–304. Kiesel, A., Miller, J., Jolicœur, P., & Brisson, B. (2008). Measurement of ERP latency differences: A comparison of single-participant and jackknife-based scoring methods. Psychophysiology, 45(2), 250–274. Kim, Y. J., Grabowecky, M., & Suzuki, S. (2006). Stochastic resonance in binocular rivalry. Vision Research, 46(3), 392–406. Klasen, M., Kenworthy, C. A., Mathiak, K. A., Kircher, T. T. J., & Mathiak, K. (2011). Supramodal representation of emotions. Journal of Neuroscience, 31(38), 13635–13643. Kleiner, M., Brainard, D., & Pelli, D. (2007). What’s new in Psychtoolbox-3? Perception, 36(ECVP Abstract Supplement). Kleinschmidt, A., Büchel, C., Zeki, S., & Frackowiak, R. S. J. (1998). Human brain activity during spontaneously reversing perception of ambiguous figures. Proceedings of the Royal Society of London. Series B: Biological Sciences, 265(1413), 2427–2433. Kleinschmidt, A., Sterzer, P., & Rees, G. (2012). Variability of perceptual multistability: From brain state to individual trait. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1591), 988–1000. Klink, P. C., Van Ee, R., Nijs, M. M., Brouwer, G. J., Noest, A. J., & Van Wezel, R. J. A. (2008). Early interactions between neuronal adaptation and voluntary control determine perceptual choices in bistable vision. Journal of Vision, 8(5), Article 16. Klink, P. C., van Wezel, R. J. A., & van Ee, R. (2012). United we sense, divided we fail: Context- driven perception of ambiguous visual stimuli. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1591), 932–941. Kloosterman, N. A., Meindertsma, T., Hillebrand, A., van Dijk, B. W., Lamme, V. A. F., & Donner, T. H. (2015). Top-down modulation in human visual cortex predicts the stability of a perceptual illusion. Journal of Neurophysiology, 113(4), 1063–1076. Kloosterman, N. A., Meindertsma, T., van Loon, A. M., Lamme, V. A. F., Bonneh, Y. S., & Donner, T. H. (2015). Pupil size tracks perceptual content and surprise. European Journal of Neuroscience, 41(8), 1068–1078. Knapen, T., Brascamp, J. W., Pearson, J., van Ee, R., & Blake, R. (2011). The Role of Frontal and Parietal Brain Areas in Bistable Perception. Journal of Neuroscience, 31(28), 10293– 10301. Knapp, C. M., Proudlock, F. A., & Gottlob, I. (2013). OKN asymmetry in human subjects: a literature review. Strabismus, 21(1), 37–49.

176 References

Koelewijn, T., Bronkhorst, A., & Theeuwes, J. (2010). Attention and the multiple stages of multisensory integration: A review of audiovisual studies. Acta Psychologica, 134(3), 372– 384. Koene, A. R. (2006). A model for perceptual averaging and stochastic bistable behavior and the role of voluntary control. Neural Computation, 18(12), 3069–3096. Köhler, S., Bär, K. J., & Wagner, G. (2016). Differential involvement of brainstem noradrenergic and midbrain dopaminergic nuclei in cognitive control. Human Brain Mapping, 37(6), 2305– 2318. Köhler, S., Wagner, G., & Bär, K. J. (2019). Activation of brainstem and midbrain nuclei during cognitive control in medicated patients with schizophrenia. Human Brain Mapping, 40(1), 202–213. Köhler, W., & Wallach, H. (1944). Figural after-effects. An investigation of visual processes. Proceedings of the American Philosophical Society, 88(4), 269–357. Kondo, H. M., Farkas, D., Denham, S. L., Asai, T., & Winkler, I. (2017). Auditory multistability and neurotransmitter concentrations in the human brain. Philosophical Transactions of the Royal Society B: Biological Sciences, 372(1714), Article 20160110. Kondo, H. M., Kitagawa, N., Kitamura, M. S., Koizumi, A., Nomura, M., & Kashino, M. (2012). Separability and commonality of auditory and visual bistable perception. Cerebral Cortex, 22(8), 1915–1922. Kondo, H. M., & Kochiyama, T. (2018). Normal Aging Slows Spontaneous Switching in Auditory and Visual Bistability. Neuroscience, 389, 152–160. Kondo, H. M., Pressnitzer, D., Shimada, Y., Kochiyama, T., & Kashino, M. (2018). Inhibition- excitation balance in the parietal cortex modulates volitional control for auditory and visual multistability. Scientific Reports, 8, Article 14548. Körding, K. P., Beierholm, U., Ma, W. J., Quartz, S., Tenenbaum, J. B., & Shams, L. (2007). Causal inference in multisensory perception. PLOS ONE, 2(9), Article e943. Koriat, A. (2011). Subjective Confidence in Perceptual Judgments: A Test of the Self-Consistency Model. Journal of Experimental Psychology, 140(1), 117–139. Kornmeier, J., & Bach, M. (2009). Object perception: When our brain is impressed but we dot not notice it. Journal of Vision, 9(1), 1–10. Kornmeier, J., & Bach, M. (2012). Ambiguous figures - what happens in the brain when perception changes but not the stimulus. Frontiers in Human Neuroscience, 6, Article 51. Kornmeier, J., Hein, C. M., & Bach, M. (2009). Multistable perception: When bottom-up and top-down coincide. Brain and Cognition, 69(1), 138–147. Krug, K., Brunskill, E., Scarna, A., Goodwin, G. M., & Parker, A. J. (2008). Perceptual switch rates with ambiguous structure-from-motion figures in bipolar disorder. Proceedings of the Royal Society B: Biological Sciences, 275(1645), 1839–1848. Kubovy, M., & Yu, M. (2012). Multistability, cross-modal binding and the additivity of conjoined grouping principles. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1591), 954–964. Kucewicz, M. T., Dolezal, J., Kremen, V., Berry, B. M., Miller, L. R., Magee, A. L., Fabian, V., & Worrell, G. A. (2018). Pupil size reflects successful encoding and recall of memory in humans. Scientific Reports, 8, Article 4949. Kundel, H. L., Nodine, C. F., Conant, E. F., & Weinstein, S. P. (2007). Holistic component of image perception in mammogram interpretation: gaze-tracking study. Radiology, 242(2), 396–402.

177 References

Laeng, B., Sirois, S., & Gredebäck, G. (2012). Pupillometry: A Window to the Preconscious? Perspectives on Psychology Science, 7(1), 18–27. Lakatos, P., O’Connell, Barczak, Mills, Javitt, & Schroeder, C. E. (2009). The leading sense: supramodal control of neurophysiological context by attention P. Neuron, 64(3), 419–430. Larsen, R. S., & Waters, J. (2018). Neuromodulatory Correlates of Pupil Dilation. Frontiers in Neural Circuits, 12, Article 21. Lehky, S. R., & Maunsell, J. H. R. (1996). No binocular rivalry in the LGN of alert macaque monkeys. Vision Research, 36(9), 1225–1234. Leopold, D. A., & Logothetis, N. K. (1999). Multistable phenomena: Changing views in perception. Trends in Cognitive Sciences, 3(7), 254–264. Leopold, D. A., Wilke, M., Maier, A., & Logothetis, N. K. (2002). Stable perception of visually ambiguous patterns. Nature Neuroscience, 5(6), 605–609. Ling, S., Hubert-Wallander, B., & Blake, R. (2010). Detecting contrast changes in invisible patterns during binocular rivalry. Vision Research, 50(23), 2421–2429. Long, G. M., & Toppino, T. C. (2004). Enduring interest in perceptual ambiguity: Alternating views of reversible figures. Psychological Bulletin, 130(5), 748–768. Long, G. M., Toppino, T. C., & Kostenbauder, J. F. (1983). As the cube turns: Evidence for two processes in the perception of a dynamic reversible figure. Perception & Psychophysics, 34(1), 29–38. Lowenstein, O., Kawabata, H., & Loewenfeld, I. E. (1964). The pupil as indicator of retinal activity. American Journal of Ophthalmology, 57(4), 569–596. Lutz, A., Lachaux, J.-P., Martinerie, J., & Varela, F. J. (2002). Guiding the study of brain dynamics by using first-person data: synchrony patterns correlate with ongoing conscious states during a simple visual task. Proceedings of the National Academy of Sciences, 99(3), 1586–1591. Malashchenko, T., Shilnikov, A., & Cymbalyuk, G. (2011). Six types of multistability in a Neuronal model based on slow calcium current. PLOS ONE, 6(7), Article e21782. Mathôt, S. (2018). Pupillometry: Psychology, Physiology, and Function. Journal of Cognition, 1(1), Article 16. Mazzucato, L., Fontanini, A., & La Camera, G. (2015). Dynamics of multistable states during ongoing and evoked cortical activity. Journal of Neuroscience, 35(21), 8214–8231. McClelland, J. L., Rumelhart, D. E., & Group, P. R. (1986). Parallel distributed processing. Explorations in the Microstructure of Cognition, 2, 216–271. McCloy, D. R., Lau, B. K., Larson, E., Pratt, K. A. I., & Lee, A. K. C. (2017). Pupillometry shows the effort of auditory attention switching. The Journal of the Acoustical Society of America, 141(4), 2440–2451. McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264(5588), 746– 748. Mechelli, A., Price, C. J., Friston, K. J., & Ishai, A. (2004). Where bottom-up meets top-down: neuronal interactions during perception and imagery. Cerebral Cortex, 14(11), 1256–1265. Meng, M., & Tong, F. (2004). Can attention selectively bias bistable perception? Differences between binocular rivalry and ambiguous figures. Journal of Vision, 4(7), 539–551. Mesulam, M.-M. (1998). From sensation to cognition. Brain: A Journal of Neurology, 121(6), 1013–1052. Mill, R. W., Bőhm, T. M., Bendixen, A., Winkler, I., & Denham, S. L. (2013). Modelling the Emergence and Dynamics of Perceptual Organisation in Auditory Streaming. PLOS Computational Biology, 9(3), Article e1002925.

178 References

Millar, A. (2000). The scope of perceptual knowledge. Philosophy, 75(291), 73–88. Miller, S. M., Liu, G. B., Ngo, T. T., Hooper, G., Riek, S., Carson, R. G., & Pettigrew, J. D. (2000). Interhemispheric switching mediates perceptual rivalry. Current Biology, 10(7), 383– 392. Molholm, S., Ritter, W., Murray, M. M., Javitt, D. C., Schroeder, C. E., & Foxe, J. J. (2002). Multisensory auditory-visual interactions during early sensory processing in humans: A high- density electrical mapping study. Cognitive Brain Research, 14(1), 115–128. Moors, A., & De Houwer, J. (2006). Automaticity: A theoretical and conceptual analysis. Psychological Bulletin, 132(2), 297–326. Moreno-Bote, R., Rinzel, J., & Rubin, N. (2007). Noise-Induced Alternations in an Attractor Network Model of Perceptual Bistability. Journal of Neurophysiology, 98(3), 1125–1139. Movshon, J. A., Adelson, E. H., Gizzi, M. S., & Newsome, W. T. (1985). The analysis of moving visual patterns. In C. Chagas, G. R., & C. Gross (Eds.), Pattern recognition mechanisms (Vol. 54, pp. 117–151). Vatican Press. Murphy, P. R., O’Connell, R. G., O’Sullivan, M., Robertson, I. H., & Balsters, J. H. (2014). Pupil diameter covaries with BOLD activity in human locus coeruleus. Human Brain Mapping, 35(8), 4140–4154. Naber, M., Frässle, S., & Einhäuser, W. (2011). Perceptual rivalry: Reflexes reveal the gradual nature of visual awareness. PLOS ONE, 6(6), Article e20910. Naber, M., Frässle, S., Rutishauser, U., Einhäuser, W., Frassle, S., Rutishauser, U., & Einhauser, W. (2013). Pupil size signals novelty and predicts later retrieval success for declarative memories of natural scenes. Journal of Vision, 13(2), Article 11. Necker, L. A. (1832). LXI. Observations on some remarkable optical phænomena seen in Switzerland; and on an optical phænomenon which occurs on viewing a figure of a crystal or geometrical solid. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 1(5), 329–337. Neufeld, J., Sinke, C., Zedler, M., Emrich, H. M., & Szycik, G. R. (2012). Reduced audio–visual integration in synaesthetes indicated by the double-flash illusion. Brain Research, 1473, 78– 86. Nienborg, H., & Cumming, B. G. (2009). Decision-related activity in sensory neurons reflects more than a neuron’s causal effect. Nature, 459(7243), 89–92. Nieuwenhuis, S., Aston-Jones, G., & Cohen, J. D. (2005). Decision making, the P3, and the locus coeruleus-norepinephrine system. Psychological Bulletin, 131(4), 510–532. Nisbett, R. E., & Miyamoto, Y. (2005). The influence of culture: holistic versus analytic perception. Trends in Cognitive Sciences, 9(10), 467–473. Noest, A. J., van Ee, R., Nijs, M. M., & van Wezel, R. J. A. (2007). Percept-choice sequences driven by interrupted ambiguous stimuli: A low-level neural model. Journal of Vision, 7(8), Article 10. Ouhnana, M., Jennings, B. J., & Kingdom, F. A. A. (2017). Common contextual influences in ambiguous and rivalrous figures. PLOS ONE, 12(5), Article e0176842. Overgaard, M., & Fazekas, P. (2016). Can No-Report Paradigms Extract True Correlates of Consciousness? Trends in Cognitive Sciences, 20(4), 241–242. Overgaard, M., Koivisto, M., Sørensen, T. A., Vangkilde, S., & Revonsuo, A. (2006). The electrophysiology of introspection. Consciousness and Cognition, 15(4), 662–672. Overgaard, M., & Mogensen, J. (2017). An integrative view on consciousness and introspection. Review of Philosophy and Psychology, 8(1), 129–141.

179 References

Palva, S., Linkenkaer-Hansen, K., Näätänen, R., & Palva, J. M. (2005). Early neural correlates of conscious somatosensory perception. Journal of Neuroscience, 25(21), 5248–5258. Panagiotaropoulos, T. I., Wang, L., & Dehaene, S. (2020). Hierarchical architecture of conscious processing and subjective experience. Cognitive Neuropsychology, 37(3–4), 180–183. Parise, C. V., & Ernst, M. O. (2017). Noise, multisensory integration, and previous response in perceptual disambiguation. PLOS Computational Biology, 13(7), Article e1005546. Pastukhov, A., & Braun, J. (2007). Perceptual reversals need no prompting by attention. Journal of Vision, 7(10), Article 5. Pastukhov, A., Kastrup, P., Abs, I. F., & Carbon, C. C. (2019). Switch rates for orthogonally oriented kinetic-depth displays are correlated across observers. Journal of Vision, 19(6), Article 1. Pelli, D. G. (1997). The VideoToolbox software for visual psychophysics: Transforming numbers into movies. Spatial Vision, 10(4), 437–442. Pelofi, C., De Gardelle, V., Egré, P., & Pressnitzer, D. (2017). Interindividual variability in auditory scene analysis revealed by confidence judgements. Philosophical Transactions of the Royal Society B: Biological Sciences, 372(1714), Article 20160107. Pettigrew, J. D. (2001). Searching for the switch: Neural bases for perceptual rivalry alternations. Brain and Mind, 2(1), 85–118. Pettigrew, J. D., & Carter, O. L. (2004). Perceptual Rivalry as an Ultradian Oscillation. In D. Alais & R. Blake (Eds.), Binocular Rivalry (pp. 283–300). MIT Press. Pisarchik, A. N., Jaimes-Reátegui, R., Magallón-García, C. D. A., & Castillo-Morales, C. O. (2014). Critical slowing down and noise-induced intermittency in bistable perception: Bifurcation analysis. Biological Cybernetics, 108(4), 397–404. Pitts, M. A., Nerger, J. L., & Davis, T. J. R. (2007). Electrophysiological correlates of perceptual reversals for three different types of multistable images. Journal of Vision, 7(1), Article 6. Pouget, A., Beck, J. M., Ma, W. J., & Latham, P. E. (2013). Probabilistic brains: Knowns and unknowns. Nature Neuroscience, 16(9), 1170–1178. Pressnitzer, D., & Hupé, J.-M. (2006). Temporal Dynamics of Auditory and Visual Bistability Reveal Common Principles of Perceptual Organization. Current Biology, 16(13), 1351–1357. Pressnitzer, D., & Hupé, J.-M. (2005). Is auditory streaming a bistable percept? Proceedings of Forum Acusticum, 1557–1561. Preuschoff, K., ’t Hart, B. M., & Einhäuser, W. (2011). Pupil dilation signals surprise: Evidence for noradrenaline’s role in decision making. Frontiers in Neuroscience, 5, Article 115. Ramachandran, V. S., & Anstis, S. M. (1983). Perceptual organization in moving patterns. Nature, 304(5926), 529–531. Rankin, J., Sussman, E., & Rinzel, J. (2015). Neuromechanistic Model of Auditory Bistability. PLOS Computational Biology, 11(11), Article e1004555. Rauss, K., & Pourtois, G. (2013). What is bottom-up and what is top-down in predictive coding. Frontiers in Psychology, 4, Article 276. Rossion, B. (2013). The composite face illusion: A whole window into our understanding of holistic face perception. Visual Cognition, 21(2), 139–253. Rubin, E. (1915). Synsoplevede Figurer. Gyldendalske Boghandel. Rubin, N. (2003). Binocular rivalry and perceptual multi-stability. Trends in Neurosciences, 26(6), 289–291.

180 References

Samuels, E., & Szabadi, E. (2008). Functional Neuroanatomy of the Noradrenergic Locus Coeruleus: Its Roles in the Regulation of Arousal and Autonomic Function Part II: Physiological and Pharmacological Manipulations and Pathological Alterations of Locus Coeruleus Activity in Humans. Current Neuropharmacology, 6(3), 254–285. Sanchez, G., Frey, J. N., Fuscà, M., & Weisz, N. (2017). Decoding across sensory modalities reveals common supramodal signatures of conscious perception. Proceedings of the National Academy of Sciences, 117(13), 7437–7446. Sandberg, K., & Overgaard, M. (2015). Using the perceptual awareness scale (PAS). In Behavioural methods in consciousness research (pp. 181–195). University Press Oxford. Sanders, R. D., Winston, J. S., Barnes, G. R., & Rees, G. (2018). Magnetoencephalographic Correlates of Perceptual State during Auditory Bistability. Scientific Reports, 8, Article 976. Sato, M., Basirat, A., & Schwartz, J. L. (2007). Visual contribution to the multistable perception of speech. Perception and Psychophysics, 69(8), 1360–1372. Schmack, K., Sekutowicz, M., Rössler, H., Brandl, E. J., Müller, D. J., & Sterzer, P. (2013). The influence of dopamine-related genes on perceptual stability. European Journal of Neuroscience, 38(9), 3378–3383. Schwartz, J.-L., Grimault, N., Hupé, J.-M., Moore, B. C. J., & Pressnitzer, D. (2012). Multistability in perception: binding sensory modalities, an overview. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1591), 896–905. Scocchia, L., Valsecchi, M., & Triesch, J. (2014). Top-down influences on ambiguous perception: the role of stable and transient states of the observer. Frontiers in Human Neuroscience, 8, Article 979. Sekuler, R., Sekular, A. B., & Lau, R. (1997). Sound alters visual motor perception. Nature, 385(23), 308. Shams, L., Kamitani, Y., & Shimojo, S. (2002). Visual illusion induced by sound. Cognitive Brain Research, 14(1), 147–152. Shams, L., & Kim, R. (2010). Crossmodal influences on visual perception. Physics of Life Reviews, 7(3), 269–284. Shiffrin, R. M., & Grantham, D. W. (1974). Can attention be allocated to sensory modalities? Perception & Psychophysics, 15(3), 460–474. Shimojo, S., & Shams, L. (2001). Sensory modalities are not separate modalities: Plasticity and interactions. Current Opinion in Neurobiology, 11(4), 505–509. Shpiro, A., Moreno-Bote, R., Rubin, N., & Rinzel, J. (2009). Balance between noise and adaptation in competition models of perceptual bistability. Journal of Computational Neuroscience, 27(1), 37–54. Simpson, H. M. (1969). Effects of a Task‐Relevant Response on Pupil Size. Psychophysiology, 6(2), 115–121. Simpson, H. M., & Hale, S. M. (1969). Pupillary Changes during a Decision-Making Task. Perceptual and Motor Skills, 29(2), 495–498. Simpson, H. M., & Paivio, A. (1968). Effects on pupil size of manual and verbal indicators of cognitive task fulfillment. Perception & Psychophysics, 3(3), 185–190. Slotnick, S. D., & Yantis, S. (2005). Common neural substrates for the control and effects of visual attention and perceptual bistability. Cognitive Brain Research, 24(1), 97–108. Snyder, J. S., Gregg, M. K., Weintraub, D. M., & Alain, C. (2012). Attention, awareness, and the perception of auditory scenes. Frontiers in Psychology, 3, Article 15. Spence, C., Parise, C. V., & Chen, Y. C. (2013). The Colavita Visual Dominance Effect. In The neural bases of multisensory processes. CRC Press/Taylor & Francis.

181 References

Stein, B. E. (2012). The new handbook of multisensory processing. MIT Press. Stein, B. E., & Stanford, T. R. (2008). Multisensory integration: Current issues from the perspective of the single neuron. Nature Reviews Neuroscience, 9(4), 255–266. Steinhauer, S. R., & Hakerem, G. A. D. (1992). The Pupillary Response in Cognitive Psychophysiology and Schizophreniaa. Annals of the New York Academy of Sciences, 658(1), 182–204. Steinhauer, S. R., Siegle, G. J., Condray, R., & Pless, M. (2004). Sympathetic and parasympathetic innervation of pupillary dilation during sustained processing. International Journal of Psychophysiology, 52(1), 77–86. Sterzer, P., & Kleinschmidt, A. (2007). A neural for inference in ambiguity. Proceedings of the National Academy of Sciences of the United States of America, 104(1), 323–328. Sterzer, P., Kleinschmidt, A., & Rees, G. (2009). The neural bases of multistable perception. Trends in Cognitive Sciences, 13(7), 310–318. Strüber, D., Basar-Eroglu, C., Hoff, E., & Stadler, M. (2000). Reversal-rate dependent differences in the EEG gamma-band during multistable visual perception. International Journal of Psychophysiology, 38(3), 243–252. Strüber, D., & Stadler, M. (1999). Differences in top - Down influences on the reversal rate of different categories of reversible figures. Perception, 28(10), 1185–1196. Strupp, M., Hufner, K., Sandmann, R., Zwergal, A., Dieterich, M., Jahn, K., & Brandt, T. (2011). Central oculomotor disturbances and nystagmus: a window into the brainstem and cerebellum. Deutsches Ärzteblatt International, 108(12), 197–204. Sturm, T., & Wunderlich, F. (2010). Kant and the scientific study of consciousness. History of the Human Sciences, 23(3), 48–71. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press. Szmalec, A., Verbruggen, F., Vandierendonck, A., De Baene, W., Verguts, T., & Notebaert, W. (2008). Stimulus ambiguity elicits response conflict. Neuroscience Letters, 435(2), 158–162. Taylor, M. M., & Aldridge, K. D. (1974). Stochastic processes in reversing figure perception. Perception & Psychophysics, 16(1), 9–25. Thaler, L., Schütz, A. C., Goodale, M. A., & Gegenfurtner, K. R. (2013). What is the best fixation target? The effect of target shape on stability of fixational eye movements. Vision Research, 76, 31–42. Theeuwes, J. (2010). Top-down and bottom-up control of visual selection. Acta Psychologica, 135(2), 77–99. Thomassen, S., & Bendixen, A. (2017). Subjective Perceptual Organization of a Complex Auditory Scene. Journal of the Acoustical Society of America, 141(1), 265–276. Toppino, T. C., & Long, G. M. (1987). Selective adaptation with reversible figures: Don’t change that channel. Perception & Psychophysics, 42(1), 37–48. Toppino, T. C., & Long, G. M. (2005). Top-down and bottom-up processes in the perception of reversible figures: Toward a hybrid model. In N. Ohta, C. M. MacLeod, & B. Uttl (Eds.), Dynamic cognitive processes (pp. 37–58). Springer. Toppino, T. C., & Long, G. M. (2015). Time for a change: What dominance durations reveal about adaptation effects in the perception of a bi-stable reversible figure. Attention, Perception, & Psychophysics, 77(3), 867–882. Treisman, A. (1982). Perceptual grouping and attention in visual search for features and for objects. Journal of Experimental Psychology: Human Perception and Performance, 8(2), 194–214.

182 References

Treisman, M., Faulkner, A., Naish, P. L. N., & Brogan, D. (1990). The internal clock: evidence for a temporal oscillator underlying time perception with some estimates of its characteristic frequency. Perception, 19(6), 705–743. Tschacher, W., & Haken, H. (2007). Intentionality in non-equilibrium systems? The functional aspects of self-organized pattern formation. New Ideas in Psychology, 25(1), 1–15. Tsuchiya, N., Wilke, M., Frässle, S., & Lamme, V. A. F. (2015). No-Report Paradigms: Extracting the True Neural Correlates of Consciousness. Trends in Cognitive Sciences, 19(12), 757–770. Tsuchiya, N., Wilke, M., Frässle, S., Lamme, V. A. F., Wilke, M., & Lamme, V. A. F. (2016). No-Report and Report-Based Paradigms Jointly Unravel the NCC: Response to Overgaard and Fazekas. Trends in Cognitive Sciences, 20(4), 242–243. Ulrich, R., & Miller, J. (2001). Using the jackknife-based scoring method for measuring LRP onset effects in factorial designs. Psychophysiology, 38(5), 816–827. van de Rijt, L. P. H., Roye, A., Mylanus, E. A. M., van Opstal, A. J., & Van Wanrooij, M. M. (2019). The principle of inverse effectiveness in audiovisual speech perception. Frontiers in Human Neuroscience, 13, Article 335. van der Wel, P., & van Steenbergen, H. (2018). Pupil dilation as an index of effort in cognitive control tasks: A review. Psychonomic Bulletin & Review, 25(6), 2005–2015. Van Ee, R., Van Boxtel, J. J. A., Parker, A. L., & Alais, D. (2009). Multisensory congruency as a mechanism for attentional control over perceptual selection. Journal of Neuroscience, 29(37), 11641–11649. van Leeuwen, C., Steyvers, M., & Nooter, M. (1997). Stability and intermittency in large-scale coupled oscillator models for perceptual segmentation. Journal of Mathematical Psychology, 41(4), 319–344. van Noorden, L. P. A. S. (1975). Temporal Coherence in the Perception of Tone Sequences. Doctoral dissertation, Technische Hogeschool Eindhoven. Vandenbroucke, A. R. E., Fahrenfort, J. J., Sligte, I. G., & Lamme, V. A. F. (2014). Seeing without knowing: neural signatures of perceptual inference in the absence of report. Journal of Cognitive Neuroscience, 26(5), 955–969. Wales, R., & Fox, R. (1970). Increment detection thresholds during binocular rivalry suppression. Perception & Psychophysics, 8(2), 90–94. Walker, P. (1978). Binocular Rivalry : Central or Peripheral Selective Processes ? Psychological Bulletin, 85(2), 376–389. Wallach, H. (1935). Über visuell wahrgenommene Bewegungsrichtung. Psychologische Forschung, 20(1), 325–380. Wallach, H., & O’connell, D. N. (1953). The kinetic depth effect. Journal of Experimental Psychology, 45(4), 205–217. Walz, J. M., Goldman, R. I., Carapezza, M., Muraskin, J., Brown, T. R., & Sajda, P. (2013). Simultaneous EEG-fMRI reveals temporal evolution of coupling between supramodal cortical attention networks and the brainstem. Journal of Neuroscience, 33(49), 19212–19222. Wang, C. A., Brien, D. C., & Munoz, D. P. (2015). Pupil size reveals preparatory processes in the generation of pro-saccades and anti-saccades. European Journal of Neuroscience, 41(8), 1102–1110. Warren, R. M. (1961). Illusory changes of distinct speech upon repetition-the verbal transformation effect. British Journal of Psychology, 52(3), 249–258. Warren, R. M., & Gregory, R. L. (1958). An auditory analogue of the visual reversible figure. The American Journal of Psychology, 71(3), 612–613.

183 References

Watamaniuk, S. N. J. (1993). Ideal observer for discrimination of the global direction of dynamic random-dot stimuli. Journal of the Optical Society of America A, 10(1), 16–28. Wegner, T. G. G., Grenzebach, J., Bendixen, A., & Einhäuser, W. (2021). Parameter dependence in visual pattern-component rivalry at onset and during prolonged viewing. Vision Research, 182, 69-88. Weilnhammer, V. A., Ludwig, K., Hesselmann, G., & Sterzer, P. (2013). Frontoparietal Cortex Mediates Perceptual Transitions in Bistable Perception. Journal of Neuroscience, 33(40), 16009–16015. Weilnhammer, V. A., Stuke, H., Hesselmann, G., Sterzer, P., & Schmack, K. (2017). A predictive coding account of bistable perception - a model-based fMRI study. PLOS Computational Biology, 13(5), Article e1005536. Wertheimer, M. (1938). Laws of Organizations in Perceptual Forms. In W. D. Ellis (Ed.), A source book of Gestalt psychology (pp. 71–88). Kegan Paul, Trench, Trubner & Company. Wetzel, N., Buttelmann, D., Schieler, A., & Widmann, A. (2016). Infant and adult pupil dilation in response to unexpected sounds. Developmental Psychobiology, 58(3), 382–392. Wheatstone, C. (1838). On some remarkable, and hitherto unobserved, phenomena of binocular vision. Philosophical Transactions of the Royal Society of London, 128, 371–394. Wierda, S. M., van Rijn, H., Taatgen, N. A., & Martens, S. (2012). Pupil dilation deconvolution reveals the dynamics of attention at high temporal resolution. Proceedings of the National Academy of Sciences, 109(22), 8456–8460. Wilbertz, G., Ketkar, M., Guggenmos, M., & Sterzer, P. (2018). Combined fMRI- and eye movement-based decoding of bistable plaid motion perception. NeuroImage, 171, 190–198. Wild, H. A., & Busey, T. A. (2004). Seeing faces in the noise: Stochastic activity in perceptual regions of the brain may influence the perception of ambiguous stimuli. Psychonomic Bulletin & Review, 11(3), 475–481. Wilson, H. R. (2003). Computational evidence for a rivalry hierarchy in vision. Proceedings of the National Academy of Sciences of the United States of America, 100(24), 14499-14503. Winkler, I., Denham, S. L., Mill, R., Bohm, T. M., & Bendixen, A. (2012). Multistability in auditory stream segregation: a predictive coding view. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1591), 1001–1012. Wohrer, A., Humphries, M. D., & Machens, C. K. (2013). Population-wide distributions of neural activity during perceptual decision-making. Progress in Neurobiology, 103, 156–193. Xu, J., Yu, L., Rowland, B. A., Stanford, T. R., & Stein, B. E. (2014). Noise-rearing disrupts the maturation of multisensory integration. European Journal of Neuroscience, 39(4), 602– 613. Xu, X., Hanganu-Opatz, I. L., & Bieler, M. (2020). Cross-Talk of Low-Level Sensory and High- Level Cognitive Processing: Development, Mechanisms, and Relevance for Cross-Modal Abilities of the Brain. Frontiers in Neurorobotics, 14, Article 7. Yo, C., & Demer, J. L. (1992). Two-dimensional optokinetic nystagmus induced by moving plaids and texture boundaries: Evidence for multiple visual pathways. Investigative Ophthalmology and Visual Science, 33(8), 2490–2500. Yoshioka, T., & Sakakibara, M. (2013). Physical aspects of sensory transduction on seeing, hearing and smelling. Biophysics, 9, 183–191. Yu, A. J. (2012). Change is in the eye of the beholder. Nature Neuroscience, 15(7), 933–935. Zeki, S. (2004). The neurology of ambiguity. Consciousness and Cognition, 13(1), 173–196. Zekveld, A. A., Koelewijn, T., & Kramer, S. E. (2018). The Pupil Dilation Response to Auditory Stimuli: Current State of Knowledge. Trends in Hearing, 22, Article 2331216518777174.

184 References

Zhou, Y. H., Gao, J. B., White, K. D., Merk, I., & Yao, K. (2004). Perceptual dominance time distributions in multistable visual perception. Biological Cybernetics, 90(4), 256–263. Zou, J., He, S., & Zhang, P. (2016). Binocular rivalry from invisible patterns. Proceedings of the National Academy of Sciences, 113(30), 8408–8413.

185 Appendix

APPENDIX

Appendix A1 List of Figures

Introduction: 1.1 Auditory and visual stimulus material of all studies and their respective percepts . . . 15 1.2 Exemplary 60 s excerpt from eye-movement (gaze) recordings in Study 2 ...... 29 1.3 Exemplary 240 s excerpt from eye-movement (pupil dilation) recordings in Study 1 . 32

Study 1: 2.1 Sound design of the chirp sounds ...... 41 2.2 Original visualization of the four auditory percepts instructed ...... 42

Study 1, Experiment 1: 2.3 Behavioral measures of perceptual dominance (percentage of the dominating percept, dominance duration and CV of the dominance duration) ...... 49 2.4 Z-standardized and baseline corrected pupil diameter at the time of the reported perceptual switch (endogenous/multistable and exogenous/replay) in auditory multistability ...... 51 2.5 Jack-knifed traces of Figure 2.4 and the derived maximal amplitudes ...... 52

Study 1, Experiment 2: 2.6 Behavioral measures of perceptual dominance (percentage of the dominating percept, dominance duration and CV of the dominance duration) ...... 59 2.7 Z-standardized and baseline corrected pupil diameter at the time of the reported perceptual switch (endogenous/multistable and exogenous/replay) in auditory multistability; additional condition: random button pressing task ...... 60 2.8 Jack-knifed traces of Figure 2.7 and the derived maximal amplitudes ...... 61

Study 1, Supplement: 2.S1 Missing pupil diameter data of all experiments ...... 66 2.S2 Experiment 1: Z-standardized and baseline corrected pupil diameter at the time of the reported perceptual switch (endogenous/multistable and exogenous/replay) in auditory multistability; additional condition: distorted random button pressing task ...... 67 2.S3 Experiment 2: Z-standardized and baseline corrected pupil diameter at the time of the reported exogenous perceptual switch (active and after correction), Jack-knifed traces and derived maximal amplitudes ...... 68 2.S4 Surrogate analysis of the pupil diameter in Experiments 1 and 2 ...... 69

Study 2, Experiment 1: 3.1 Stimuli and paradigm ...... 75 3.2 Individual decoding accuracy ...... 77 3.3 Consistency between decoded visual and reported auditory percept; relative to the noise ceiling ...... 79 3.4 Eye-movement variables to compute the SVM-features ...... 89 186 Appendix

Study 3, Experiment 1: 4.1 Original excerpt from instruction of the two visual percepts instructed ...... 99 4.2 Experiment design Block order ...... 100 4.3 Visual inducer ...... 101 4.4 Individual decoding accuracy ...... 109 4.5 Difference wave (report) A1: button-based percept trace ...... 111

4.6 Difference wave (report) A2: SVM-based percept trace; categorization:SVM-based 112 4.7 Difference wave (report) A3: SVM-based percept trace; categorization:button-based 112 4.8 Difference wave (no-report) A4: SVM-based percept trace ...... 113 4.9 Collapsed view for Figures 4.5 – 4.9 ...... 113 4.10 Behavioral measures of perceptual dominance (percentage of the dominating percept, phase duration) ...... 114

Study 3, Experiment 2: 4.11 Auditory stimulus...... 121 4.12 Original visualization of the two auditory percepts instructed ...... 122 4.13 Design of unimodal blocks ...... 124 4.14 Design of bimodal blocks ...... 125 4.15 Across the board: hit-rate, decoding accuracy, consistency of auditory-visual percept 134 4.16 Behavioral measures of visual perceptual dominance (percentage of the dominating percept, phase duration) ...... 135 4.17 Difference wave (report: visual) B1: button-based percept trace ...... 137 4.18 Difference wave (report: visual) B2: SVM-based percept trace; categorization: SVM based ...... 138 4.19 Difference wave (report: visual) B3: SVM-based percept trace; categorization: button- based ...... 138 4.20 Behavioral measures of auditory perceptual dominance (percentage of the dominating

percept, phase duration) ...... 139

4.21 Difference wave (report: auditory) B4: button-based percept trace ...... 141 4.22 Difference wave (no-report: visual) B5: SVM-based percept trace; categorization: SVM- based ...... 141 4.23 Comparison difference wave B1 and B3 ...... 142 4.24 Comparison difference wave B2 and B3 ...... 143 4.25 Comparison of the parameters derived from difference wave B2 and B5 ...... 144

Study 3, Supplement: 4.U1 Visual percept traces at the time of the auditory perceptual switches ...... 151 4.U2 Global measures of dominance in unimodal multistability ...... 152

187 Appendix

A2 List of Tables 1.1 Stimuli used in the Experiments of all Studies ...... 25 1.2 Number of percepts, indirect tracking of the perceptual state across the Experiments of all Studies ...... 33

4.1 Number of percept epochs constituting the difference wave A1 ...... 110 4.2 Number of percept epochs constituting the difference waves B1 ...... 136 4.3 Number of percept epochs constituting the difference waves B4 ...... 139

188 Appendix

A3 List of Abbreviations and Symbols

ACC Adjusted accuracy, decoding accuracy of the SVM Ana Analysis ANOVA Analysis of variance beh Behavioral bi Bimodal stimulation c Cost-parameter, see SVM ca. circa cm Centimeter, 10−2 Meter cd Candela CV Coefficient of variation dB(A) A-weighted decibels FDR False discovery rate, statistics Hz Hertz ind Inducer condition int Integrated percept kHz Kilohertz, 103 Hertz m Meter M Mean, μ max Maximum min Minute mini Minimum ms Millisecond no-ind No-inducer condition OKN Optokinetic nystagmus PD Pupil diameter px / pix Pixel s Second SD Standard deviation, σ SEM Standard error of the mean seg Segregated percept seg_both Segregated percept with both streams in foreground seg_high Segregated percept with high-frequency stream in foreground seg_low Segregated percept with low-frequency stream in foreground SOA Stimulus onset asynchrony SVM Support vector machine svm SVM-based analysis uni Unimodal stimulation und Undefined percept, see int and seg percept VAC Visual-auditory consistency, calculated as the ACC zPD Z-standardized pupil trace % Percent Δ Delta, here difference in the frequency Ω Ohm, here impedance

189