Neural Encoding with Visual Attention

Neural encoding with visual attention Meenakshi Khosla1;∗, Gia H. Ngo1, Keith Jamison3, Amy Kuceyeski3;4 and Mert R. Sabuncu1;2;3 1 School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853 2 Nancy E. and Peter C. Meinig School of Biomedical Engineering, Cornell University, Ithaca, NY 14853 3 Radiology, Weill Cornell Medicine, New York, NY 10065 4 Brain and Mind Research Institute, Weill Cornell Medicine, New York, NY 10065 ∗ Correspondence: [email protected] Abstract Visual perception is critically influenced by the focus of attention. Due to limited resources, it is well known that neural representations are biased in favor of attended locations. Using concurrent eye-tracking and functional Magnetic Resonance Imag- ing (fMRI) recordings from a large cohort of human subjects watching movies, we first demonstrate that leveraging gaze information, in the form of attentional masking, can significantly improve brain response prediction accuracy in a neural encoding model. Next, we propose a novel approach to neural encoding by includ- ing a trainable soft-attention module. Using our new approach, we demonstrate that it is possible to learn visual attention policies by end-to-end learning merely on fMRI response data, and without relying on any eye-tracking. Interestingly, we find that attention locations estimated by the model on independent data agree well with the corresponding eye fixation patterns, despite no explicit supervision to do so. Together, these findings suggest that attention modules can be instrumental in neural encoding models of visual stimuli. 1 1 Introduction Developing accurate population-wide neural encoding models that predict the evoked brain response directly from sensory stimuli has been an important goal in computational neuroscience. Modeling neural responses to naturalistic stimuli, in particular stimuli that reflect the complexity of real-world scenes (e.g., movies), offers significant promise to aid in understanding the human brain as it functions in everyday life [1]. Much of the recent success in predictive modeling of neural responses is driven by deep neural networks trained on tasks of behavioral relevance. For example, features extracted from deep neural networks trained on image or auditory recognition tasks are currently the best predictors of neural responses across visual and auditory brain regions, respectively [2, 3, 4]. While this success is promising, the unexplained variance is still large enough to prompt novel efforts in model development for this task. One aspect that is often overlooked in existing neural encoding models in vision is visual attention. Natural scenes are highly complex and cluttered, typically containing a myriad of objects. What we perceive upon viewing complex, naturalistic stimuli depends significantly on where we direct our attention. It is well known that multiple objects in natural scenes compete for neural resources and attentional guidance helps to resolve the ensuing competition [5]. Due to the limited information processing capacity of the visual system, neural activity is biased in favor of the attended location [6, 7]. Hence, more salient objects tend to be more strongly and robustly represented in our brains. Further, several theories have postulated that higher regions of the visual stream encode increasingly 1Our code is available at https://github.com/mk2299/encoding_attention. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada. shift- and scale-invariant representations of attended objects after filtering out interference from surrounding clutter [8, 9]. These studies suggest that deployment of attention results in an information bottleneck, permitting only the most salient objects to be represented in the inferotemporal (IT) cortex, particularly the ventral visual stream which encodes object identity. These findings together indicate that visual attention mechanisms can be crucial to model neural responses of the higher visual system. Visual attention and eye movements are tightly interlinked. Where we direct our gaze often quite accurately signals the focus of our attention [10]. This form of attention, known as overt spatial attention, can be directly measured by eye-tracking. Recent work has shown that fMRI activity can be used to directly predict fixation maps or eye movement patterns under free-viewing of natural scenes, suggesting a strong link between neural representations and eye movements [11]. In a similar vein, Sinz et al. [12] demonstrated that gaze shifts as estimated from pupil locations and behavioral states can be very useful in modeling spiking activity of mouse V1 neurons. More recent large-scale efforts in such concurrent data collection on humans, such as the Human Connectome Project (HCP) [13], that simultaneously record fMRI and eye-tracking measurements on a large population under free-viewing of movies, present a novel opportunity to probe the potential role of attention in neural encoding models of ecological stimuli. Our contributions in this study are as follows: • We demonstrate that leveraging information about attended locations in an input image can be helpful in predicting the evoked neural response. Particularly, we show that attentional masking of high-level stimulus representations based on human fixation maps can dramati- cally improve neural response prediction accuracy for naturalistic stimuli across large parts of the cortex. • We show that it is possible to use supervision from neural response prediction solely to co-train a visual attention network. This training strategy thus encourages only those salient parts of the image to dominate the prediction of the neural response. We find that the neural encoding model with this trained attention module outperforms encoding models with no or fixed attention. • Interestingly, we find that despite not being explicitly trained to predict fixations, the attention network within the neural encoding model compares favorably against saliency prediction models that aim to directly predict likely human fixation locations given an input image. This suggests that neural response prediction can be a powerful supervision signal for learning where humans attend in cluttered scenes with multiple objects. This signals a novel opportunity for utilizing functional brain recordings during free-viewing to understand visual attention. 2 Methods Neural encoding models comprise two major components: a representation (feature extraction) module that extracts relevant representations from raw stimuli and a response model that predicts neural activation patterns from the feature space. We propose to integrate a trainable soft-attention module on top of the representation network to learn attention schemes that guide the prediction of whole-brain neural response. Our proposed methodology is illustrated in Figure 1. Feature extraction network We employ the state-of-the-art ResNet-50 [14] architecture pre- trained for object recognition on ImageNet [15] as the representation network to extract semantically rich features from raw input images. In this study, we focus on improving neural response prediction in higher-order regions of the visual pathway where receptive fields are larger and not limited to a single hemi-field. Prior evidence suggests that these regions are likely best modelled by deeper layers of object recognition networks [3, 16]. Thus, we extract the output of the last "residual block", namely res5 (after addition) before the global pooling operation to encode all images into a 2048-channel high-level feature representation image (of size 23 × 32, in our experiments), denoted as Frep. All pre-trained weights are kept frozen during training of the neural encoding models. Attention network The attention network operates on the 2048-channel feature representation image Frep. For simplicity, we employed a single convolutional layer that constructs the saliency map 5×5×2048×1 with a trainable 5 × 5 filter Vatt 2 R as, S = Gσ ∗ [Vatt ∗ Frep]+. Here, j · j+ denotes 2 Figure 1: Proposed method. A trainable soft-attention module is implemented on top of a pre-trained representation network to rescale features based on their salience. The rescaled features are spatially pooled and fed into a convolutional response model to predict whole-brain neural response. We assess the value of the trained attention network by comparing it with neural encoding methods employing (i) stimulus-dependent attention maps derived from human fixations (AG), (ii) stimulus-independent attention map derived from all fixations in the training set that reflects the center-weighted bias of our dataset (AC ) as well as a (iii) no attention model that spatially pools the features directly with no scaling. the ReLU operation and Gσ∗ indicates blurring using a 5 × 5 gaussian kernel with σ = 1. The attention scores for each pixel are finally computed from saliency maps by normalizing with the spatial softmax operation, exp S(i) A(i) = ; i 2 f1; ::; ng: l Pn (j) (1) j=1 exp S Here, superscript i is used to index the 23 × 32 spatial locations in the feature map Frep. We note that existing literature on selective visual attention suggests a hierarchical winner-take-all mechanism for saliency computation, where only the particular subset of the input image that is attended is consequently represented in higher visual systems [7]. The softmax operation can be construed as approximating this winner-take-all mechanism. The attention is consequently applied as element-wise

Neural Encoding with Visual Attention

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support