Variational Autoencoder: an Unsupervised Model For

bioRxiv preprint doi: https://doi.org/10.1101/214247; this version posted November 5, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Variational Autoencoder: An Unsupervised Model for Modeling and Decoding fMRI Activity in Visual Cortex Kuan Han2,3, Haiguang Wen2,3, Junxing Shi2,3, Kun-Han Lu2,3, Yizhen Zhang2,3, Zhongming Liu*1,2,3 1Weldon School of Biomedical Engineering 2School of Electrical and Computer Engineering 3Purdue Institute for Integrative Neuroscience Purdue University, West Lafayette, Indiana, 47906, USA *Correspondence Zhongming Liu, PhD Assistant Professor of Biomedical Engineering Assistant Professor of Electrical and Computer Engineering College of Engineering, Purdue University 206 S. Martin Jischke Dr. West Lafayette, IN 47907, USA Phone: +1 765 496 1872 Fax: +1 765 496 1459 Email: [email protected] bioRxiv preprint doi: https://doi.org/10.1101/214247; this version posted November 5, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 Abstract 2 Goal-driven convolutional neural networks (CNN) have been shown to be able to predict 3 and decode cortical responses to natural images or videos. Here, we explored an alternative deep 4 neural network, variational auto-encoder (VAE), as a computational model of the visual cortex. 5 We trained a VAE with a five-layer encoder and a five-layer decoder to learn visual representations 6 from a diverse set of unlabeled images. Inspired by the “free-energy principle” in neuroscience, 7 we modeled the brain’s bottom-up and top-down pathways using the VAE’s encoder and decoder, 8 respectively. Following such conceptual relationships, we found that the VAE was able to predict 9 cortical activities observed with functional magnetic resonance imaging (fMRI) from three human 10 subjects watching natural videos. Compared to CNN, VAE resulted in relatively lower prediction 11 accuracies, especially for higher-order ventral visual areas. On the other hand, fMRI responses 12 could be decoded to estimate the VAE’s latent variables, which in turn could reconstruct the visual 13 input through the VAE’s decoder. This decoding strategy was more advantageous than alternative 14 decoding methods based on partial least square regression. This study supports the notion that the 15 brain, at least in part, bears a generative model of the visual world. 16 Keywords: neural encoding, variational autoencoder, generative model, visual reconstruction bioRxiv preprint doi: https://doi.org/10.1101/214247; this version posted November 5, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 17 Introduction 18 Humans readily make sense of the visual world through complex neuronal circuits. 19 Understanding the human visual system requires not only measurements of brain activity but also 20 computational models with built-in hypotheses about neural computation and learning 21 (Kietzmann, McClure and Kriegeskorte 2017). Models that truly reflect the brain’s working in 22 natural vision should be able to explain brain activity given any visual input (encoding), and 23 decode brain activity to infer visual input (decoding) (Naselaris et al. 2011). Therefore, evaluating 24 the model’s encoding and decoding performance serves to test and compare hypotheses about how 25 the brain learns and organizes visual representations (Wu, David and Gallant 2006). 26 In one class of hypotheses, the visual system consists of feature detectors that progressively 27 extract and integrate features for visual recognition. For example, Gabor and wavelet filters allow 28 extraction of such low-level features as edges and motion (Hubel and Wiesel 1962, van Hateren 29 and van der Schaaf 1998), and explain brain responses in early visual areas (Kay et al. 2008, 30 Nishimoto et al. 2011). More recently, convolutional neural networks (CNNs) encode multiple 31 layers of features in a brain-inspired feedforward network (LeCun, Bengio and Hinton 2015), and 32 support human-like performance in image recognition (Simonyan and Zisserman 2014, He et al. 33 2016, Krizhevsky, Sutskever and Hinton 2012). Such models bear hierarchically organized 34 representations similar as in the brain itself (Khaligh-Razavi and Kriegeskorte 2014, Cichy et al. 35 2016), and shed light on neural encoding and decoding during natural vision (Yamins et al. 2014, 36 Horikawa and Kamitani 2017, Eickenberg et al. 2017, Guclu and van Gerven 2015, Wen et al. 37 2017b). For these reasons, goal-driven CNNs are gaining attention as favorable models of the 38 visual system (Kriegeskorte 2015, Yamins and DiCarlo 2016). Nevertheless, biological learning 39 is not entirely goal-driven but often unsupervised (Barlow 1989), and the visual system has not 40 only feedforward (bottom-up) but feedback (top-down) connections (Salin and Bullier 1995, 41 Bastos et al. 2012). 42 In another class of hypotheses, the visual world is viewed as the outcome of a probabilistic 43 and generative process (Fig. 1A): any visual input results from a generative model that samples 44 the hidden “causes” of the input from their probability distributions (Friston 2010). In light of this 45 view, the brain behaves as an inference machine: recognizing and predicting visual input through 46 “analysis by synthesis” (Yuille and Kersten 2006). The brain’s bottom-up process analyzes visual 47 input to infer the “cause” of the input, and its top-down process predicts the input from the brain’s 48 internal representations. Both processes are optimized by learning from visual experience in order 49 to avoid the “surprise” or error of prediction (Rao and Ballard 1999, Friston and Kiebel 2009). 50 This hypothesis attempts to account for both feedforward and feedback connections in the brain, 51 align with the humans’ ability to construct mental images, and offer a basis for unsupervised 52 learning. Thus, it is compelling for both computational neuroscience (Friston 2010, Rao and 53 Ballard 1999, Bastos et al. 2012, Yuille and Kersten 2006) and artificial intelligence (Hinton et al. 54 1995, Lotter, Kreiman and Cox 2016, Mirza, Courville and Bengio 2016). 55 In line with this notion is the so-called variational autoencoder (VAE) (Kingma and 56 Welling 2013). VAE uses independently distributed “latent” random variables to code the causes 57 of the visual world. VAE learns the latent variables from images via an encoder, and samples the 58 latent variables to generate new images via a decoder, where the encoder and decoder are both 59 neural networks that can be trained from a large dataset without supervision (Doersch 2016). 60 Hypothetically, VAE offers a potential model of the brain’s visual system, if the brain also captures 3 bioRxiv preprint doi: https://doi.org/10.1101/214247; this version posted November 5, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 61 the causal structure of the visual world. As such, the latent variables in VAE should match (up to 62 linear projection) neural representations given naturalistic visual input; the generative component 63 in VAE should enable an effective and generalizable way to decode brain activity during either 64 visual perception or imagery (Du, Du and He 2017, Güçlütürk et al. 2017, van Gerven, de Lange 65 and Heskes 2010a). Figure 1 | The unsupervised inference model of vision. (A) Regarding the brain as an inference machine. The brain analyzes sensory data by approximating the causes that generate the sensations. Its bottom-up process maps from input to sensory causes and top-down process predicts the input based on inferred causes. (B) Encoding and decoding cortical activities with variational autoencoder. For encoding, cortical activities are predicted by using VAE responses given visual stimuli; For decoding, latent variables are predicted from fMRI measurements and mapped to visual reconstruction through VAE decoder. 66 This led us to investigate VAE as a candidate model of the visual cortex for unsupervised 67 learning of visual representations (Fig. 1B). In this study, we addressed the relationship between 4 bioRxiv preprint doi: https://doi.org/10.1101/214247; this version posted November 5, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 68 VAE and the brain’s “free-energy” principle (Friston 2010), and then tested the degree to which 69 VAE could be used to predict and decode cortical responses observed with functional magnetic 70 resonance imaging (fMRI) during naturalistic movie stimuli. 71 72 Methods and Materials 73 Theory: variational autoencoder 74 In general, variational autoencoder (VAE) is a type of deep neural networks that learns 75 representations from complex data without supervision (Kingma and Welling 2013). A VAE 76 includes an encoder and a decoder, both of which are implemented as neural nets. The encoder 77 learns latent variables from the input data, and the decoder learns to generate similar input data 78 from samples of the latent variables. Given large input datasets, the encoder and the decoder are 79 trained together by minimizing the reconstruction loss and the Kullback-Leibler (KL) divergence 80 between the distributions of the latent variables and independent standard normal distributions 81 (Doersch 2016). When the input data are natural images, the latent variables represent the causes 82 of the images. The encoder serves as an inference model that attempts to infer the latent causes of 83 any input image; the decoder serves as a generative model that generates new images by sampling 84 latent variables. 85 Mathematically, let 풛 be the latent variables and 풙 be images.

Load more