Fei-Fei, Imagenet: a Large-Scale Hierarchical Image Database
Total Page:16
File Type:pdf, Size:1020Kb
Where have we been? Where are we going? LI F E I – F EI The Beginning: CVPR 2009 J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, ImageNet: A Large-Scale Hierarchical Image Database. IEEE Computer Vision and Pattern Recognition (CVPR), 2009. The Impact of on Google Scholar 4,386 Citations 2,847 Citations …and many more. From Challenge Contestants to Startups A Revolution in Deep Learning Why Deep Learning is Suddenly Changing Your Life By Roger Parloff The Great Artificial Intelligence Awakening By Gideon Lewis-KrausThe data that transformed AI research—and possibly the world By Dave Gershgorn “The of x” SpaceNet MusicNet Medical ImageNet DigitalGlobe, CosmiQ Works, NVIDIA J. Thickstun et al, 2017 Stanford Radiology, 2017 ShapeNet EventNet ActivityNet A.Chang et al, 2015 G. Ye et al, 2015 F. Heilbron et al, 2015 An Explosion of Datasets 1627 276 1919 1MM 4MM Hosted Datasets Commercial Student Data Scientists ML Models Competitions Competitions Submitted “Datasets—not algorithms—might be the key limiting factor to development of human-level artificial intelligence.” A L E X A N D E R W I S S N E R - G R O S S Edge.org, 2016 The Untold History of Hardly the First Image Dataset Segmentation (2001) CMU/VASC Faces (1998) FERET Faces (1998) COIL Objects (1996) MNIST digits (1998-10) D. Martin, C. Fowlkes, D. Tal, J. Malik. H. Rowley, S. Baluja, T. Kanade P. Phillips, H. Wechsler, J. S. Nene, S. Nayar, H. Murase Y LeCun & C. Cortes Huang, P. Raus KTH human action (2004) Sign Language (2008) UIUC Cars (2004) 3D Textures (2005) CuRRET Textures (1999) I. Leptev & B. Caputo P. Buehler, M. Everingham, A. S. Agarwal, A. Awan, D. Roth S. Lazebnik, C. Schmid, J. Ponce K. Dana B. Van Ginneken S. Nayar Zisserman J. Koenderink CAVIAR Tracking (2005) Middlebury Stereo (2002) CalTech 101/256 (2005) LabelMe (2005) ESP (2006) R. Fisher, J. Santos-Victor J. Crowley D. Scharstein R. Szeliski Fei-Fei et al, 2004 Russell et al, 2005 Ahn et al, 2006 GriffIn et al, 2007 MSRC (2006) PASCAL (2007) Lotus Hill TinyImage (2008) Shotton et al. 2006 Everingham et al, 2009 (2007) Torralba et al. 2008 Yao et al, 2007 A Profound Machine Learning Problem Within Visual Learning Machine Learning 101: Complexity, Generalization, Overfitting Error Underfitting Overfitting Zone Zone Generalization Error Training Generalization Error Gap Optimal Capacity Capacity One-Shot Learning Fei-Fei et al, 2003, 2004 Fei-Fei et al, 2003, 2004 How Children Learn to See Error Underfitting Overfitting Zone Zone Generalization Error Training Generalization Error Gap Optimal Capacity Capacity A new way of thinking… To shift the focus of Machine Learning for visual recognition from …to data. modeling… Lots of data. Internet Data Growth 1990-2010 15,000 11,250 7,500 3,750 Global Data Traffic (PB/month) Source: Cisco What is WordNet? Establishes Organizes over ontological and 150,000 words into lexical relationships 117,000 categories Original paper by in NLP and related called synsets. [George Miller, et tasks. al 1990] cited over 5,000 times Christiane Fellbaum Senior Research Scholar Computer Science Department, Princeton President, Global WordNet Consortium Individually Illustrated WordNet Nodes jacket: a short coat German shepherd: breed of large shepherd dogs used in A massive ontology of police work and as a guide for the blind. images to transform computer vision microwave: kitchen appliance that cooks food by passing an electromagnetic wave through it. mountain: a land mass that projects well above its surroundings; higher than a hill. Comrades Prof. Kai Li Jia Deng Princeton 1st Ph.D. student Princeton Entity Step 1: Ontological Mammal structure based on WordNet Dog German Shepherd Dog German Step 2: Populate categories Shepherd with thousands of images from the Internet Dog German Step 3: Clean results by Shepherd hand Three Attempts at Launching 1st Attempt: The Psychophysics Experiment ImageNet PhD Students Miserable Undergrads 1st Attempt: The Psychophysics Experiment • # of synsets: 40,000 (subject to: imageability analysis) • # of candidate images to label per synset: 10,000 • # of people needed to verify: 2-5 • Speed of human labeling: 2 images/sec (one fixation: ~200msec) • Massive parallelism (N ~ 10^2-3) ≈ 19 years 40,000 × 10,000 × 3 / 2 = 6000,000,000 sec N 2nd Attempt: Human-in-the-Loop Solutions 2nd Attempt: Human-in-the-Loop Solutions Machine-generated Human-generated datasets can only match datasets transcend the best algorithms of algorithmic limitations, the time. leading to better machine perception. 3rd Attempt: A Godsend Emerges ImageNet PhD Students Crowdsourced Labor 49k Workers from 167 Countries 2007-2010 The Result: Goes Live in 2009 What We Did Right While Others Targeted Detail… LabelMe Lotus Hill Per-Object Regions and Labels Hand-Traced Parse Trees Russell et al, 2005 Yao et al, 2007 …We Targeted Scale SUN, 131K [Xiao et al. ‘10] LabelMe, 37K [Russell et al. ’07] 15M PASCAL VOC, 30K [Deng et al. ’09] [Everingham et al. ’06-’12] Caltech101, 9K [Fei-Fei, Fergus, Perona, ‘03] Additional Goals Carnivore - Canine - Dog - Working Dog - Husky High High-Quality Free of Resolution Annotation Charge To better replicate human visual To create a benchmarking dataset To ensure immediate application and acuity and advance the state of machine a sense of community perception, not merely reflect it An Emphasis on Community and Achievement Large Scale Visual Recognition Challenge (ILSVRC 2010-2017) ILSVRC Contributors Alex Berg Jia Deng Zhiheng Huang Aditya Khosla Jonathan Krause Fei-Fei Li UNC Chapel Hill Univ. of Michigan Stanford Stanford Stanford Stanford Wei Liu Sean Ma Eunbyung Park Olga Russakovsky Sanjeev Satheesh Hao Su UNC Chapel Hill Stanford UNC Chapel Hill Stanford Stanford Stanford Our Inspiration: PASCAL VOC 2005-2012 Our Inspiration: PASCAL VOC Mark Everingham Prize @ ECCV 2016 Mark Everingham 1973-2012 Alex Berg, Jia Deng, Fei-Fei Li, Wei Liu, Olga Russakovsky Participation and Performance 172 157 123 81 35 29 2010 2011 2012 2013 2014 2015 2016 Number of Entries Participation and Performance 0.28 172 157 123 81 0.03 35 29 2010 2011 2012 2013 2014 2015 2016 Number of Classification Entries Errors (top-5) Participation and Performance 0.28 0.66 172 157 123 81 0.03 0.23 35 29 2010 2011 2012 2013 2014 2015 2016 Number of Classification Average Precision Entries Errors (top-5) For Object Detection What we did to make better Lack of Details Lack of Details…ILSVRC Detection Challenge PASCAL ILSVRC Statistics VOC 2012 2013 Object classes 20 10x 200 Images 5.7K 70x 395K Training Objects 13.6K 25x 345K Evaluation of ILSVRC Detection Need to annotate the presence of all classes (to penalize false detections) Table Chair Horse Dog Cat Bird # images: 400K # classes: 200 + + - - - - # annotations = 80M! + - - - + - + + - - - - Evaluation of ILSVRC Detection Hierarchical annotation J. Deng, O. Russakovsky, J. Krause, M. Bernstein, A. Berg, & L. Fei-Fei. CHI, 2014 What does classifying 10K+ classes tell us? J. Deng, A. Berg & L. Fei-Fei, ECCV, 2010 Fine-Grained Recognition “Cardigan Welsh Corgi” “Pembroke Welsh Corgi” Fine-Grained Recognition cars [Gebru, Krause, Deng, Fei-Fei, CHI 2017] 2567 classes 700k images Expected Outcomes Machine learning ImageNet becomes a Breakthroughs in advances and changes benchmark object recognition dramatically Unexpected Outcomes Neural Nets are Cool Again! 13,259 Citations Krizhevsky, Sutskever & Hinton, NIPS 2012 …And Cooler and Cooler “AlexNet” “GoogLeNet” “VGG Net” “ResNet” [Krizhevsky et al. NIPS 2012] [Szegedy et al. CVPR 2015] [Simonyan & Zisserman, [He et al. CVPR 2016] ICLR 2015] Neural Nets A Deep Learning Revolution GPUs Ontological Structure Structure Not Used as Much Thing is a Animalia Chordate Arthropoda Mammal Insect Wombat is a Primate Carnivora Diptera Marsupial Hominidae Pongidae Felidae Muscidae Homo Pan Felis Musca is a Sapiens Troglodytes Domestica Leo Domestica Wombat Human Chimpanzee House Cat Lion Housefly Deng, Krause, Berg & Fei-Fei, CVPR 2012 Thing Thing is a Animalia Animal Chordate Arthropoda Mammal Insect Mammal Wombat is a Primate Carnivora Diptera Marsupial Hominidae Pongidae Felidae Muscidae Marsupial Homo Pan Felis Musca is a Sapiens Troglodytes Domestica Leo Domestica Wombat Human Chimpanzee House Cat Lion Housefly Wombat Deng, Krause, Berg & Fei-Fei, CVPR 2012 Thing is a Animalia Chordate Arthropoda Mammal Insect WombatMaximize Specificity ( f ) is a Primate Carnivora Diptera Subject to Accuracy ( f ) ≥ 1 - ε Marsupial Hominidae Pongidae Felidae Muscidae Homo Pan Felis Musca is a Sapiens Troglodytes Domestica Leo Domestica Wombat Human Chimpanzee House Cat Lion Housefly Deng, Krause, Berg & Fei-Fei, CVPR 2012 Optimizing with a Knowledge Ontology Results in Big Gains in Information at Arbitrary Accuracy Our Model Deng, Krause, Berg & Fei-Fei, CVPR 2012 Relatively Few Works Have Used Ontology ECCV 2012 Best paper Award Kuettel, Guillaumin, Ferrari. Segmentation Propagation in ImageNet. ECCV 2012 Most works still use 1M images to do pre-training 1M Images 15M Images Total “First, we find that the performance on vision tasks still increases linearly with orders of magnitude of training data size.” C. Sun et al, 2017 How Humans Compare Andrej Karpathy. http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/ How Humans Compare Human GoogLeNet 5.1% 6.8% Top-5 error rate Top-5 error rate Susceptible to: Susceptible to: • Fine-grained recognition • Small, thin objects • Class unawareness • Image filters • Insufficient training data • Abstract representations • Miscellaneous sources Andrej Karpathy. http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/ What Lies Ahead Moving from object recognition… room person person person person person scale …to human-level understanding. room person Watching and laughing person Wants Standing on to weigh person himself scale Stepping on Wants to play a prank Stepping on a scale adds weight and ups the reading.