<<

Self-supervised Neural Network Models of Higher Visual Cortex Development

Chengxu Zhuang1, Siming Yan2, Aran Nayebi1, Daniel Yamins1 1: Stanford University, 2: Peking University

Abstract veloping methods of self- and semi-supervised learning algo- Deep convolutional neural networks (DCNNs) optimized rithms that train DCNNs with very few or no labels (Caron et for visual object categorization have achieved success in al., 2018; et al., 2018; Doersch et al., 2015; et al., modeling neural responses in the ventral visual pathway 2016; Zhuang, , & Yamins, 2019; Tarvainen & Valpola, of adult primates. However, training DCNNs has long re- 2017; Zhuang, Ding, et al., 2019). Although the performance quired large-scale labelled datasets, in stark contrast to of these models is still below that of their supervised coun- the actual process of primate visual development. Here terparts, recent progress with deep embedding methods has we present a network training curriculum, based on re- shown substantial promise in bridging this gap (Caron et al., cent state-of-the-art self-supervised training algorithms, 2018; Wu et al., 2018; Zhuang, Zhai, & Yamins, 2019; Zhuang, that achieves high levels of task performance without the Ding, et al., 2019). need for unrealistically many labels. We then compare In this work, we show that the feature representations the DCNN as it evolves during training both to neural learned by these state-of-the-art self- and semi-supervised data recorded from macaque visual cortex, and to de- DCNNs model the adult ventral visual stream just as effec- tailed metrics of visual behavior patterns. We find that tively as those of category-supervised networks. This, in turn, the self-supervised DCNN curriculum not only serves as allows us to design a training curriculum that better simu- a candidate hypothesis for visual development trajectory, lates the real developmental trajectory of the visual system. but also produces a final network that accurately models More specifically, we first train DCNNs on self-supervised vi- neural and behavioral responses. sual tasks, simulating the learning in early infancy when su- pervision is entirely absent (Cooper & Aslin, 1989). We find Keywords: Ventral visual stream; visual development; deep that the best of these “self-DCNNs” — based on the state-of- neural network; self-supervised; semi-supervised the-art Local Aggregation approach to training deep embed- dings (Zhuang, Zhai, & Yamins, 2019) — predicts neural re- Introduction sponses in primate V4 and IT areas with high accuracy, map- Deep Convolutional Neural Networks (DCNNs) trained to ping these brain areas to anatomically reasonably intermedi- solve high-variation object recognition tasks have been shown ate and late hidden layers respectively. In fact, this self-DCNN to be quantitatively accurate models of neural responses in the has somewhat better V4 and IT neural prediction performance ventral visual stream of adult primates (Yamins et al., 2014; than its supervised counterpart. Unsurprisingly, however, the Kriegeskorte, 2015; Cadena et al., 2019). These same net- visual categorization behavior produced by the output layer works also generate visual behavior patterns that are consis- of the task-generic self-DCNN is somewhat less consistent tent with those of humans and macaques (Rajalingham et al., with that of humans and monkeys than that of network trained 2018). Though this progress at the intersection of machine on categorization tasks. To address this gap, we then train learning and computational neuroscience is intriguing, there is the self-DCNNs using semi-supervised algorithms with a very a fundamental problem confronting these approaches: train- small number of labelled data points, simulating late infancy ing DCNNs has long used heavily supervised methods involv- when a modest amount of supervision occurs (Cooper & Aslin, ing huge numbers of high-level semantic labels. For example, 1989; Topping, Dekhinet, & Zeedyk, 2013). The resulting state-of-the-art DCNNs for visual object recognition are typi- “semi-DCNNs” produce behavioral outputs that are substan- cally trained on ImageNet (Deng et al., 2009), comprising 1.2 tially more similar to that of humans and monkeys than those million labeled images. When viewed as mere technical tools of self-DCNNs, approaching the behavioral consistency levels for tuning network parameters, such procedures can be ac- of supervised networks. Taken together, our results suggest ceptable, although they limit the purview of the methods to that this “self-then-semi-supervised” curriculum not only pro- situations with large existing labelled datasets. However, as duces accurate models of the ventral visual stream, but also real models of biological learning, they are highly unrealistic, may serve as a candidate quantitative hypothesis for neural since humans and primates develop their visual systems with changes during visual development. very few explicit labels (Braddick & Atkinson, 2011; Atkinson, 2002; Harwerth, Smith, Duncan, Crawford, & Von Noorden, Results 1986; Bourne & Rosa, 2005). This developmental difference Our curriculum approximates visual development using two significantly undermines the effectiveness of DCNNs as mod- stages: a self-supervised stage and a semi-supervised stage els of visual learning. (see Fig. 1A). To evaluate the effectiveness of the trained DC- Motivated by the need for more label-efficient training pro- NNs in modeling the ventral visual stream, we report their cedures, deep learning researchers have been actively de- neural prediction performance for neurons in V4 and IT cor- Fg 2 (Fig. be can 2 These (Fig. rithms infancy. families: early two into in organized learning visual hy- for different potheses represent which self-DCNNs, representative eral stage. Self-supervised 1 Fig. (see monkeys and humans to consistency 1 2018). Fig. al., et (see (Schrimpf areas in tical monkeys metric and predictivity” humans “behavioral the from classes data are report common We to 24 data from compared task. Neural images is choice of forced matrix set alternative confusion statistics. a same image-by-category on summary the layer resultant penultimate performing The as model’s each &c). reported on from “pen”, trained then correlation “wrench”, medians are (“dog”, are Pearson classifiers The linear area the which brain in computed. neuron, 2018), is each ( each al., IT 2015). responses in For al., and recorded neurons et V4 2017). the (Majaj across the from and al., correlations predict responses et Pearson to DCNN-predicted (Klindt used these the then in of between are as layer images regression convolutional validation linear each held-out from regularized ( activations with unit responses stream. DCNN neural visual collected. ventral were inputs. the visual responses modeling labeled neural dots for red metrics inputs, evaluation visual unlabeled and represent ( dots curriculum Gray training curriculum. DCNN training “self-then-semi-supervised” the stage of Illustration 1: Figure omnefrbt 4adI ern seFg 2 Fig. (see neurons IT and per- MID- V4 prediction while both neural neurons) for of formance levels IT high for show not self-DCNNs (but based neurons re- V4 neural 2 for predict Fig. accurately sponses self-DCNNs (see SIS-based requirement that many specific find across some distribution meets feature inputs the biological that other so representations from 2 available Fig. (see or signals ex- inputs sensory directly visual either from statistics with tracted et models (Laina supervise estimation 2016), surface-normals/depth al., and 2016) al., et omne(hag hi ais 09,as civsop- achieves also 2019), Yamins, & Zhai, per- (Zhuang, recognition formance visual to unsupervised shown been state-of-the-art recently achieve has which (LA), Aggregation re- Local neural the 2 IT Fig. predicting (see sponses best representations high-level responses and with neural correspondence, V4 predicting mapping best representations brain mid-level similar a show DCNNs B,C ocmuenua rdcinpromnefrV n Tcria ra,w rtrnteDNso h tmlsfrwhich for stimulus the on DCNNs the run first we areas, cortical IT and V4 for performance prediction neural compute To ) B .SSagrtm,icuigiaeclrzto (Zhang colorization image including algorithms, SIS ). A and ) .B. A. D ut-mg distribution multi-image .W n httesl-CNbsdon based self-DCNN the that find We ). C Self-supervised stage Semi-supervised stage n iulctgrzto behavioral categorization visual and )

...... “strawberry” ...... D A nti tg,w opr sev- compare we stage, this In omauebhvoa ossec,w olwtemto f(aaiga ta. 08 crmfet Schrimpf 2018; al., et (Rajalingham of method the follow we consistency, behavioral measure To ) igeiaestatistic single-image .MDagrtm piievisual optimize algorithms MID ). Time D. DCNN MD algorithms (MID) DCNN Image D C ). SS algo- (SIS) Features .Alself- All ). B .We ). ClassifierLinear ...... C. “Wrench” or“Dog” “Wrench” or“Pen” h etsprie CN Rjlnhme l,21) We 2018). al., et and (Rajalingham LLP-DCNN DCNNs the supervised best between the consistency significant behavioral a still in is gap there may representations, experience visual ac- visual higher plausible semi-supervised shape and a un- toward how progress of count represent results our While low- the 2019). in al., especially et Ding, tasks, (Zhuang, visual regime label substan- on LLP MT that fact outperforms the with tially consistent is result net- This unsupervised the works. all and MT, than consistent behaviorally 3 Fig. see (LLP, 2019) see 3 (MT, Fig. 2017) Valpola, & (Tarvainen algorithms, Teacher Mean semi-supervised including high-performing recent consistent We two is 2013). test al., that et (Topping number measurements a e.g. developmental with — datapoints, dataset labelled ImageNet the distinct in 36k labels of 3% with supervision (see DCNNs 3 humans Fig. category-supervised in of those behavior than monkeys categorization and visual fine- of with metrics consistent neural grained less IT are and outputs final-layer V4 its predicting responses, in accuracy high show LA-DCNN stage. Semi-supervised performance. prediction neural timal Unit C

A ...... Image .T l hsgp efrhrtanL-CNuigsemi- using LA-DCNN train further we gap, this fill To ). ,adLclLblPoaain(hag ig tal., et Ding, (Zhuang, Propagation Label Local and ), ...... icsinadFtr Directions Future and Discussion ......

...... Regression Linear B .W n htLPi infiatymore significantly is LLP that find We ). Neuron Primate lhuhteitra aeso the of layers internal the Although DCNN

...... V4 area ...... IT area Image Similarity

...... A Two- ) A. Relative Position Colorization Depth Prediction

Input = ( , ), Label = 8 Input Label Input Label Deep Instance Local B. Cluster Recognition Aggregation

DCNN Embed ...... V4 area V4 area C. D. 0.7 0.7

0.6 0.6 IT area IT area

of best layer 0.6 0.6 Neural prediction perf. Neural prediction perf. Neural prediction perf.

0.4 0.4 UT Col CatLAIRDCDPRP 1713951 Task Layer

Figure 2: Self-supervised visual tasks and neural prediction. (A) Single-Image Statistics (SIS) tasks. For Relative Position (RP) (Doersch et al., 2015), the DCNN receives two patches from one image and classifies their spatial organization. For Colorization (Col) (Zhang et al., 2016), the DCNN receives a gray-scale image and predicts its per-pixel RGB values. For Depth Prediction (DP), the DCNN receives a RGB image and predicts the per-pixel relative depth generated by standardizing the depth across the whole image. The implementation follows (Laina et al., 2016). (B) Schematic for Multi-Image Distribution (MID) algorithms. All images are first embedded into a lower dimensional space using DCNNs. For Deep Cluster (DC) (Caron et al., 2018), K-means is applied to the embeddings and the cluster labels are used as category labels to train the DCNN. For Instance Recognition (IR) (Wu et al., 2018), the DCNN is optimized to maximize the distances between the embedding of the current input (red dot) and the embeddings of all the other images (gray dots). For Local Aggregation (LA) (Zhuang, Zhai, & Yamins, 2019), the DCNN is optimized to minimize the distance to “close” embedding points (blue dots) and to maximize the distance to the “further” embedding points (black dots) for the current input (red dot). (C) V4 and IT neural prediction performance of the best layer in DCNNs trained using different self-supervised tasks. “UT” represents untrained ResNet-18. “Cat” represents a ResNet-18 trained on the ImageNet categorization task. Error bars represent standard deviation computed from 5 independent train-validation splits. (D) V4 and IT neural prediction performance of all layers for LA-trained DCNN. The x-axis represents the number of convolutional layers away from the inputs. hope future semi-supervised algorithms can help bridge this Bambach, S., Crandall, D. J., Smith, L. B., & , C. (2017). gap. Moreover, the current DCNN training uses ImageNet im- An egocentric perspective on active vision and visual object ages, whose statistics are likely to be meaningfully different learning in toddlers. In 2017 icdl-epirob (pp. 290–295). from the visual inputs infants perceive during visual develop- Bourne, J. A., & Rosa, M. G. (2005). Hierarchical development ment. For example, DCNNs receive static and independent of the primate visual cortex, as revealed by neurofilament images during training, whereas infants receive a continuous immunoreactivity: early maturation of the middle temporal stream of inputs that are temporally correlated and object- area (mt). Cerebral cortex, 16(3), 405–414. centric (Bambach, Crandall, Smith, & Yu, 2017). We hope Braddick, O., & Atkinson, J. (2011). Development of human to incorporate these ideas, and tighten the connection to de- visual function. Vision research, 51(13), 1588–1609. velopmental data, in future work. Cadena, S. A., Denfield, G. H., Walker, E. Y., Gatys, L. A., Tolias, A. S., Bethge, M., & Ecker, A. S. (2019). Deep con- References volutional models improve predictions of macaque v1 re- sponses to natural images. PLoS computational biology, Atkinson, J. (2002). The developing visual brain. 15(4), e1006897. aaiga,R,Is,E . ahvn . a,K,Schmidt, K., Kar, P., Bashivan, B., E. Issa, R., Rajalingham, (2015). J. J. DiCarlo, & A., E. Solomon, H., Hong, J., N. Majaj, Navab, & F., Tombari, V., Belagiannis, C., Rupprecht, I., Laina, frame- new a networks: neural Deep (2015). N. Kriegeskorte, Neural (2017). M. Bethge, & T., Euler, S., A. Ecker, D., Klindt, M., Crawford, C., G. Duncan, L., E. Smith, S., R. Harwerth, Unsupervised (2015). A. A. Efros, & Fei-Fei, A., Gupta, & C., K., Doersch, , L.-J., Li, R., Socher, W., Dong, J., Deng, envi- language The (1989). N. R. Aslin, & P., R. Cooper, (2018). M. Douze, & A., Joulin, P., Bojanowski, typically M., is Caron, runs multiple across measures these of deviation standard The are to labeled. pseudolabels distances images inferred their of The 3% 0.001. to points). only according with colored images ImageNet these nearby on around ( trained these points DCNN. unfilled of the (highlighted labels optimize densities to the neighbor used weighting local by their For inferred as is images). well lower pseudolabel as unlabeled a ( into represent a DCNN. embedded point and student first unfilled points) are simi- and the colored images images produce of ( All labeled to weights image represent algorithm. unlabeled the and 2019) points each of images al., colored averages et panel, labeled Ding, exponential (middle the (Zhuang, space are categorize (LLP) dimensional weights Propagation correctly whose Label to Local DCNN” optimized for “teacher Schematic is another DCNN” to “student predictions the lar training, During ( consistency. algorithm. behavioral and Semi-supervision 3: Figure rn ae cuaeypeithmncr betrecogni- 13418. object core human performance. tion predict accurately rates neuronal temporal firing inferior of sums weighted learned Simple convolutional fully In with networks. prediction residual depth Deeper (2016). N. information brain and vision biological processing. modeling for work systems In what separating where. populations and large for identification system periods system. sensitive visual Multiple primate 232 the of (1986). development the K. in G. Noorden, Von & In 1422–1430). prediction. (pp. context by learning representation image visual hierarchical large-scale A Imagenet: database. (2009). L. psychologie de percep- canadienne early for Implications development. tual infant: young the of ronment fea- visual of learning In unsupervised tures. for clustering Deep 44) 235–238. (4747), p.3506–3516). (pp. Eccv A. C. nulrve fvso science vision of review Annual 0.15 Behavior0.20 p.132–149). (pp. dacsi erlifrainprocessing information neural in Advances LA ora fNeuroscience of Journal aainJunlo Psychology/Revue of Journal Canadian Average 3% Cat Exp. ? 06fut 3dv fourth 2016 Teacher Student DCNN DCNN ntemdl ae) ru flbldiae eryi medn pc sietfid(highlighted identified is space embedding in nearby images labeled of group a panel), middle the in MT C , 43 eairlcnitnyo CN rie ydfeettss 3 a”rpeet ResNet-18 a represents Cat” “3% tasks. different by trained DCNNs of consistency behavioral ) “Strawberry” 2,247. (2), CatLLP v v v 2 1 0 B. p.239–248). (pp. , 35 , ...... 1 3) 13402– (39), 417–446. , Embedding DCNN

Science Optimize Iccv A ceai o h enTahr(T Trann&Vloa 2017) Valpola, & (Tarvainen (MT) Teacher Mean the for Schematic ) , hag . hi .L,&Ymn,D 21) oa aggrega- Local (2019). D. Yamins, & L., A. Zhai, C., Local Zhuang, (2019). D. Yamins, & image D., Colorful Murli, X., (2016). Ding, C., A. Zhuang, A. Efros, & P., Isola, R., Zhang, Seib- A., E. Solomon, F., C. Cadieu, H., Hong, L., D. Yamins, Unsupervised (2018). D. , & X., S. Yu, Y., Xiong, Z., Wu, Parent–infant (2013). S. Zeedyk, & R., Dekhinet, K., Topping, better are teachers Mean (2017). H. Valpola, & A., Tarvainen, Rajalingham, J., N. Majaj, H., Hong, J., Kubilius, M., Schrimpf, infrusprie erigo iulembeddings. arXiv:1903.12355 visual preprint of learning unsupervised for tion arXiv:1905.11581 preprint learning. arXiv semi-supervised large-scale for propagation label In colorization. 111 visual higher hi- in Performance-optimized cortex. responses neural (2014). predict models J. erarchical J. DiCarlo, & D., ert, In discrimination. instance non-parametric via learning feature development. Psychology tional language childrens and interaction systems processing information ral In results. learning deep improve semi-supervised targets consistency Weight-averaged models: role brain- most Which is Brain-score: recognition like? (2018). object J. for J. network DiCarlo, neural . artificial . . B., E. Issa, R., neural networks. high-resolution artificial deep Large-scale, of state-of-the-art behavior and monkeys, recognition humans, object (2018). visual core the J. of comparison J. DiCarlo, & K., Cvpr 2) 8619–8624. (23), iRi preprint bioRxiv rceig fteNtoa cdm fSciences of Academy National the of Proceedings p.3733–3742). (pp. ora fNeuroscience of Journal Propagation Local Label Eccv , 33 p.649–666). (pp. . 4,391–426. (4), . Pseudolabels . , p.1195–1204). (pp. 38 3) 7255–7269. (33), dacsi neu- in Advances Educa- arXiv B ? ) ,