Where have we been? Where are we going?
LI F E I – F EI The Beginning: CVPR 2009
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, ImageNet: A Large-Scale Hierarchical Image Database. IEEE Computer Vision and Pattern Recognition (CVPR), 2009. The Impact of on Google Scholar
4,386 Citations
2,847 Citations
…and many more. From Challenge Contestants to Startups A Revolution in Deep Learning
Why Deep Learning is Suddenly Changing Your Life
By Roger Parloff The Great Artificial Intelligence Awakening
By Gideon Lewis-KrausThe data that transformed AI research—and possibly the world
By Dave Gershgorn “The of x”
SpaceNet MusicNet Medical ImageNet DigitalGlobe, CosmiQ Works, NVIDIA J. Thickstun et al, 2017 Stanford Radiology, 2017
ShapeNet EventNet ActivityNet A.Chang et al, 2015 G. Ye et al, 2015 F. Heilbron et al, 2015 An Explosion of Datasets
1627 276 1919 1MM 4MM Hosted Datasets Commercial Student Data Scientists ML Models Competitions Competitions Submitted “Datasets—not algorithms—might be the key limiting factor to development of human-level artificial intelligence.”
A L E X A N D E R W I S S N E R - G R O S S Edge.org, 2016 The Untold History of Hardly the First Image Dataset
Segmentation (2001) CMU/VASC Faces (1998) FERET Faces (1998) COIL Objects (1996) MNIST digits (1998-10) D. Martin, C. Fowlkes, D. Tal, J. Malik. H. Rowley, S. Baluja, T. Kanade P. Phillips, H. Wechsler, J. S. Nene, S. Nayar, H. Murase Y LeCun & C. Cortes Huang, P. Raus
KTH human action (2004) Sign Language (2008) UIUC Cars (2004) 3D Textures (2005) CuRRET Textures (1999) I. Leptev & B. Caputo P. Buehler, M. Everingham, A. S. Agarwal, A. Awan, D. Roth S. Lazebnik, C. Schmid, J. Ponce K. Dana B. Van Ginneken S. Nayar Zisserman J. Koenderink
CAVIAR Tracking (2005) Middlebury Stereo (2002) CalTech 101/256 (2005) LabelMe (2005) ESP (2006) R. Fisher, J. Santos-Victor J. Crowley D. Scharstein R. Szeliski Fei-Fei et al, 2004 Russell et al, 2005 Ahn et al, 2006 GriffIn et al, 2007
MSRC (2006) PASCAL (2007) Lotus Hill TinyImage (2008) Shotton et al. 2006 Everingham et al, 2009 (2007) Torralba et al. 2008 Yao et al, 2007 A Profound Machine Learning Problem Within Visual Learning Machine Learning 101: Complexity, Generalization, Overfitting
Error Underfitting Overfitting Zone Zone
Generalization Error Training Generalization Error Gap
Optimal Capacity Capacity One-Shot Learning
Fei-Fei et al, 2003, 2004
Fei-Fei et al, 2003, 2004 How Children Learn to See Error Underfitting Overfitting Zone Zone
Generalization Error Training Generalization Error Gap
Optimal Capacity Capacity A new way of thinking…
To shift the focus of Machine Learning for visual recognition
from …to data. modeling… Lots of data. Internet Data Growth 1990-2010
15,000
11,250
7,500
3,750
Global Data Traffic (PB/month)
Source: Cisco What is WordNet?
Establishes Organizes over ontological and 150,000 words into lexical relationships 117,000 categories Original paper by in NLP and related called synsets. [George Miller, et tasks. al 1990] cited over 5,000 times Christiane Fellbaum Senior Research Scholar Computer Science Department, Princeton President, Global WordNet Consortium Individually Illustrated WordNet Nodes
jacket: a short coat
German shepherd: breed of large shepherd dogs used in A massive ontology of police work and as a guide for the blind. images to transform computer vision
microwave: kitchen appliance that cooks food by passing an electromagnetic wave through it.
mountain: a land mass that projects well above its surroundings; higher than a hill. Comrades
Prof. Kai Li Jia Deng Princeton 1st Ph.D. student Princeton Entity
Step 1: Ontological Mammal structure based on WordNet Dog
German Shepherd Dog
German Step 2: Populate categories Shepherd with thousands of images from the Internet Dog
German Step 3: Clean results by Shepherd hand Three Attempts at Launching 1st Attempt: The Psychophysics Experiment
ImageNet PhD Students
Miserable Undergrads 1st Attempt: The Psychophysics Experiment
• # of synsets: 40,000 (subject to: imageability analysis) • # of candidate images to label per synset: 10,000 • # of people needed to verify: 2-5
• Speed of human labeling: 2 images/sec (one fixation: ~200msec) • Massive parallelism (N ~ 10^2-3)
≈ 19 years 40,000 × 10,000 × 3 / 2 = 6000,000,000 sec N 2nd Attempt: Human-in-the-Loop Solutions 2nd Attempt: Human-in-the-Loop Solutions
Machine-generated Human-generated datasets can only match datasets transcend the best algorithms of algorithmic limitations, the time. leading to better machine perception. 3rd Attempt: A Godsend Emerges
ImageNet PhD Students
Crowdsourced Labor 49k Workers from 167 Countries 2007-2010 The Result: Goes Live in 2009
What We Did Right While Others Targeted Detail…
LabelMe Lotus Hill Per-Object Regions and Labels Hand-Traced Parse Trees Russell et al, 2005 Yao et al, 2007 …We Targeted Scale
SUN, 131K [Xiao et al. ‘10]
LabelMe, 37K [Russell et al. ’07] 15M PASCAL VOC, 30K [Deng et al. ’09] [Everingham et al. ’06-’12]
Caltech101, 9K [Fei-Fei, Fergus, Perona, ‘03] Additional Goals
Carnivore - Canine - Dog - Working Dog - Husky
High High-Quality Free of Resolution Annotation Charge To better replicate human visual To create a benchmarking dataset To ensure immediate application and acuity and advance the state of machine a sense of community perception, not merely reflect it An Emphasis on Community and Achievement
Large Scale Visual Recognition Challenge (ILSVRC 2010-2017) ILSVRC Contributors
Alex Berg Jia Deng Zhiheng Huang Aditya Khosla Jonathan Krause Fei-Fei Li UNC Chapel Hill Univ. of Michigan Stanford Stanford Stanford Stanford
Wei Liu Sean Ma Eunbyung Park Olga Russakovsky Sanjeev Satheesh Hao Su UNC Chapel Hill Stanford UNC Chapel Hill Stanford Stanford Stanford Our Inspiration: PASCAL VOC
2005-2012 Our Inspiration: PASCAL VOC
Mark Everingham Prize @ ECCV 2016 Mark Everingham 1973-2012 Alex Berg, Jia Deng, Fei-Fei Li, Wei Liu, Olga Russakovsky Participation and Performance
172 157 123
81
35 29
2010 2011 2012 2013 2014 2015 2016
Number of Entries Participation and Performance
0.28
172 157 123
81
0.03 35 29
2010 2011 2012 2013 2014 2015 2016
Number of Classification Entries Errors (top-5) Participation and Performance
0.28 0.66
172 157 123
81
0.03 0.23 35 29
2010 2011 2012 2013 2014 2015 2016
Number of Classification Average Precision Entries Errors (top-5) For Object Detection What we did to make better Lack of Details Lack of Details…ILSVRC Detection Challenge
PASCAL ILSVRC Statistics VOC 2012 2013
Object classes 20 10x 200
Images 5.7K 70x 395K
Training
Objects 13.6K 25x 345K Evaluation of ILSVRC Detection Need to annotate the presence of all classes (to penalize false detections)
Table Chair Horse Dog Cat Bird # images: 400K # classes: 200 + + - - - - # annotations = 80M! + - - - + - + + - - - - Evaluation of ILSVRC Detection Hierarchical annotation
J. Deng, O. Russakovsky, J. Krause, M. Bernstein, A. Berg, & L. Fei-Fei. CHI, 2014 What does classifying 10K+ classes tell us?
J. Deng, A. Berg & L. Fei-Fei, ECCV, 2010
Fine-Grained Recognition
“Cardigan Welsh Corgi” “Pembroke Welsh Corgi” Fine-Grained Recognition
cars [Gebru, Krause, Deng, Fei-Fei, CHI 2017]
2567 classes 700k images Expected Outcomes
Machine learning ImageNet becomes a Breakthroughs in advances and changes benchmark object recognition dramatically Unexpected Outcomes Neural Nets are Cool Again!
13,259 Citations
Krizhevsky, Sutskever & Hinton, NIPS 2012 …And Cooler and Cooler
“AlexNet” “GoogLeNet” “VGG Net” “ResNet”
[Krizhevsky et al. NIPS 2012] [Szegedy et al. CVPR 2015] [Simonyan & Zisserman, [He et al. CVPR 2016] ICLR 2015] Neural Nets
A Deep Learning Revolution
GPUs Ontological Structure Structure Not Used as Much Thing
is a
Animalia
Chordate Arthropoda
Mammal Insect Wombat is a Primate Carnivora Diptera
Marsupial Hominidae Pongidae Felidae Muscidae
Homo Pan Felis Musca
is a
Sapiens Troglodytes Domestica Leo Domestica
Wombat Human Chimpanzee House Cat Lion Housefly
Deng, Krause, Berg & Fei-Fei, CVPR 2012 Thing Thing
is a
Animalia Animal
Chordate Arthropoda
Mammal Insect Mammal Wombat is a Primate Carnivora Diptera
Marsupial Hominidae Pongidae Felidae Muscidae Marsupial
Homo Pan Felis Musca
is a
Sapiens Troglodytes Domestica Leo Domestica
Wombat Human Chimpanzee House Cat Lion Housefly Wombat
Deng, Krause, Berg & Fei-Fei, CVPR 2012 Thing
is a
Animalia
Chordate Arthropoda
Mammal Insect WombatMaximize Specificity ( f ) is a Primate Carnivora Diptera Subject to Accuracy ( f ) ≥ 1 - ε Marsupial Hominidae Pongidae Felidae Muscidae
Homo Pan Felis Musca
is a
Sapiens Troglodytes Domestica Leo Domestica
Wombat Human Chimpanzee House Cat Lion Housefly
Deng, Krause, Berg & Fei-Fei, CVPR 2012 Optimizing with a Knowledge Ontology Results in Big Gains in Information at Arbitrary Accuracy
Our Model
Deng, Krause, Berg & Fei-Fei, CVPR 2012 Relatively Few Works Have Used Ontology
ECCV 2012 Best paper Award
Kuettel, Guillaumin, Ferrari. Segmentation Propagation in ImageNet. ECCV 2012 Most works still use 1M images to do pre-training
1M Images
15M Images Total “First, we find that the performance on vision tasks still increases linearly with orders of magnitude of training data size.”
C. Sun et al, 2017 How Humans Compare
Andrej Karpathy. http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/ How Humans Compare
Human GoogLeNet
5.1% 6.8% Top-5 error rate Top-5 error rate
Susceptible to: Susceptible to: • Fine-grained recognition • Small, thin objects • Class unawareness • Image filters • Insufficient training data • Abstract representations • Miscellaneous sources
Andrej Karpathy. http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/ What Lies Ahead Moving from object recognition…
room
person
person person person person scale …to human-level understanding.
room
person
Watching and laughing person Wants Standing on to weigh person himself scale Stepping on
Wants to play a prank
Stepping on a scale adds weight and ups the reading. Inverse Graphics
Image credit: https://www.youtube.com/watch?v=ip-KIzQmcBo (Oliver Villar) lady
ImageNet: Deng et al. 2009; COCO: Lin et al. 2014 sky snow building leaves vest ski bag sunglasses jacket head lady pole hat tree glove equipment boots coat …
“A lady in pink dress is skiing.”
COCO: Lin et al. 2014 sky snow building leaves vest ski bag sunglasses jacket head lady pole hat tree glove equipment boots coat … “A lady in pink dress is skiing.”
“A man standing.” “A clear blue sky at a ski resort.” “A snowy hill is in front of pine trees.” “There are several pine trees.” “A group of people getting ready to ski.”
Q: What is the man in the center doing? A: Standing on a ski. Q: What is the color of the sky? A: Blue Q: Where are the pine trees? A: Behind the hill.
entire universe of images
[Johnson et al., CVPR 2015] Visual Genome Dataset A dataset, a knowledge base, an ongoing effort to connect structural image concepts to language.
Specs Goals • 108,249 images (COCO images) • Beyond nouns • 4.2M image descriptions – Objects, verbs, attibutes • 1.8M Visual QA (7W) • Beyond object classification • 1.4M objects, 75.7K obj. classes – Relationships and contexts • 1.5M relationships, 40.5K rel. classes • Sentences and QAs • 1.7M attributes, 40.5K attr. classes • From Perception to Cognition • Vision and language correspondences • Everything mapped to WordNet Synset
Krishna et al. IJCV 2016 Visual Genome Dataset A dataset, a knowledge base, an ongoing effort to connect structural image concepts to language.
DenseCap & Relationship Image Visual Q&A Paragraph Prediction Retrieval w/ Zhu et al. CVPR’16 Generation Krishna et al. Scene Graphs Karpathy et al. CVPR’16 ECCV’16 Johnson et al. Krause et al. CVPR’17 CVPR’15 Xu et al. CVPR’17
Q: What is the person sitting on the right of the elephant wearing? A: A blue shirt.
Krishna et al. IJCV 2016 Visual Genome Dataset A dataset, a knowledge base, an ongoing effort to connect structural image concepts to language.
DenseCap & Relationship Image Visual Q&A WorkshopParagraph on VisualPrediction UnderstandingRetrieval w/ by LearningZhu et al. CVPR’16 from Generation Krishna et al. Scene Graphs Karpathy et al. CVPR’16 ECCV’16 Web DataJohnson 2017 et al. Krause et al. CVPR’17 CVPR’15 Xu et al. CVPR’17 26 July 2017 | Honolulu, Hawaii in conjunction with CVPR 2017
http://www.vision.ee.ethz.ch/webvision/workshop.html
Q: What is the person sitting on the right of the elephant wearing? A: A blue shirt.
Krishna et al. IJCV 2016 The Future of Vision and Intelligence
Vision
Language Agency: The integration of perception, understanding Understanding and action
Action
81 Eight Years of Competitions
10× 3× 2010-2017 reduction of image improvement of classification error detection precision What Happens Now?
We’re passing the Why? baton to Kaggle: a image-net.org democratizing community of more remains live at data is vital to than 1M data Stanford. democratizing AI. scientists. What Happens Now?
ImageNet Object Localization Challenge ImageNet Object Detection Challenge ImageNet Object Detection from Video Challenge Contributors/Friends/Advisors Alex Berg Derek Hoiem Eunbyung Park Michael Bernstein Zhiheng Huang Chuck Rosenberg Edward Chang Andrej Karpathy Olga Russakovksy Brendan Collins Aditya Khosla Sanjeev Satheesh Jia Deng Jonathan Krause Richard Socher Minh Do Fei-Fei Li Hao Su Wei Dong Kai Li Zhe Wang Alexei Efros Li-Jia Li Andrew Zisserman Mark Everingham Wei Liu Christiane Fellbaum Sean Ma 49k Amazon Mechanical Turk Workers Adam Finkelstein Xiaojuan Ma Thomas Funkhouser Jitendra Malik Timnit Gebru Dan Osherson “This is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning.”
W I N S T O N C H U R C H I L L