Where have we been? Where are we going?

LI F E I – F EI The Beginning: CVPR 2009

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, ImageNet: A Large-Scale Hierarchical Image Database. IEEE and Pattern Recognition (CVPR), 2009. The Impact of on Scholar

4,386 Citations

2,847 Citations

…and many more. From Challenge Contestants to Startups A Revolution in Deep Learning

Why Deep Learning is Suddenly Changing Your Life

By Roger Parloff The Great Awakening

By Gideon Lewis-KrausThe data that transformed AI research—and possibly the world

By Dave Gershgorn “The of x”

SpaceNet MusicNet Medical ImageNet DigitalGlobe, CosmiQ Works, NVIDIA J. Thickstun et al, 2017 Stanford Radiology, 2017

ShapeNet EventNet ActivityNet A.Chang et al, 2015 G. Ye et al, 2015 F. Heilbron et al, 2015 An Explosion of Datasets

1627 276 1919 1MM 4MM Hosted Datasets Commercial Student Data Scientists ML Models Competitions Competitions Submitted “Datasets—not algorithms—might be the key limiting factor to development of human-level artificial intelligence.”

A L E X A N D E R W I S S N E R - G R O S S Edge.org, 2016 The Untold History of Hardly the First Image Dataset

Segmentation (2001) CMU/VASC Faces (1998) FERET Faces (1998) COIL Objects (1996) MNIST digits (1998-10) D. Martin, C. Fowlkes, D. Tal, J. Malik. H. Rowley, S. Baluja, T. Kanade P. Phillips, H. Wechsler, J. S. Nene, S. Nayar, H. Murase Y LeCun & C. Cortes Huang, P. Raus

KTH human action (2004) Sign Language (2008) UIUC Cars (2004) 3D Textures (2005) CuRRET Textures (1999) I. Leptev & B. Caputo P. Buehler, M. Everingham, A. S. Agarwal, A. Awan, D. Roth S. Lazebnik, C. Schmid, J. Ponce K. Dana B. Van Ginneken S. Nayar Zisserman J. Koenderink

CAVIAR Tracking (2005) Middlebury Stereo (2002) CalTech 101/256 (2005) LabelMe (2005) ESP (2006) R. Fisher, J. Santos-Victor J. Crowley D. Scharstein R. Szeliski Fei-Fei et al, 2004 Russell et al, 2005 Ahn et al, 2006 GriffIn et al, 2007

MSRC (2006) PASCAL (2007) Lotus Hill TinyImage (2008) Shotton et al. 2006 Everingham et al, 2009 (2007) Torralba et al. 2008 Yao et al, 2007 A Profound Problem Within Visual Learning Machine Learning 101: Complexity, Generalization, Overfitting

Error Underfitting Overfitting Zone Zone

Generalization Error Training Generalization Error Gap

Optimal Capacity Capacity One-Shot Learning

Fei-Fei et al, 2003, 2004

Fei-Fei et al, 2003, 2004 How Children Learn to See Error Underfitting Overfitting Zone Zone

Generalization Error Training Generalization Error Gap

Optimal Capacity Capacity A new way of thinking…

To shift the focus of Machine Learning for visual recognition

from …to data. modeling… Lots of data. Internet Data Growth 1990-2010

15,000

11,250

7,500

3,750

Global Data Traffic (PB/month)

Source: Cisco What is WordNet?

Establishes Organizes over ontological and 150,000 words into lexical relationships 117,000 categories Original paper by in NLP and related called synsets. [George Miller, et tasks. al 1990] cited over 5,000 times Christiane Fellbaum Senior Research Scholar Computer Science Department, Princeton President, Global WordNet Consortium Individually Illustrated WordNet Nodes

jacket: a short coat

German shepherd: breed of large shepherd dogs used in A massive ontology of police work and as a guide for the blind. images to transform computer vision

microwave: kitchen appliance that cooks food by passing an electromagnetic wave through it.

mountain: a land mass that projects well above its surroundings; higher than a hill. Comrades

Prof. Kai Li Jia Deng Princeton 1st Ph.D. student Princeton Entity

Step 1: Ontological Mammal structure based on WordNet Dog

German Shepherd Dog

German Step 2: Populate categories Shepherd with thousands of images from the Internet Dog

German Step 3: Clean results by Shepherd hand Three Attempts at Launching 1st Attempt: The Psychophysics Experiment

ImageNet PhD Students

Miserable Undergrads 1st Attempt: The Psychophysics Experiment

• # of synsets: 40,000 (subject to: imageability analysis) • # of candidate images to label per synset: 10,000 • # of people needed to verify: 2-5

• Speed of human labeling: 2 images/sec (one fixation: ~200msec) • Massive parallelism (N ~ 10^2-3)

≈ 19 years 40,000 × 10,000 × 3 / 2 = 6000,000,000 sec N 2nd Attempt: Human-in-the-Loop Solutions 2nd Attempt: Human-in-the-Loop Solutions

Machine-generated Human-generated datasets can only match datasets transcend the best algorithms of algorithmic limitations, the time. leading to better machine perception. 3rd Attempt: A Godsend Emerges

ImageNet PhD Students

Crowdsourced Labor 49k Workers from 167 Countries 2007-2010 The Result: Goes Live in 2009

What We Did Right While Others Targeted Detail…

LabelMe Lotus Hill Per-Object Regions and Labels Hand-Traced Parse Trees Russell et al, 2005 Yao et al, 2007 …We Targeted Scale

SUN, 131K [Xiao et al. ‘10]

LabelMe, 37K [Russell et al. ’07] 15M PASCAL VOC, 30K [Deng et al. ’09] [Everingham et al. ’06-’12]

Caltech101, 9K [Fei-Fei, Fergus, Perona, ‘03] Additional Goals

Carnivore - Canine - Dog - Working Dog - Husky

High High-Quality Free of Resolution Annotation Charge To better replicate human visual To create a benchmarking dataset To ensure immediate application and acuity and advance the state of machine a sense of community perception, not merely reflect it An Emphasis on Community and Achievement

Large Scale Visual Recognition Challenge (ILSVRC 2010-2017) ILSVRC Contributors

Alex Berg Jia Deng Zhiheng Huang Aditya Khosla Jonathan Krause Fei-Fei Li UNC Chapel Hill Univ. of Michigan Stanford Stanford Stanford Stanford

Wei Liu Sean Ma Eunbyung Park Olga Russakovsky Sanjeev Satheesh Hao Su UNC Chapel Hill Stanford UNC Chapel Hill Stanford Stanford Stanford Our Inspiration: PASCAL VOC

2005-2012 Our Inspiration: PASCAL VOC

Mark Everingham Prize @ ECCV 2016 Mark Everingham 1973-2012 Alex Berg, Jia Deng, Fei-Fei Li, Wei Liu, Olga Russakovsky Participation and Performance

172 157 123

81

35 29

2010 2011 2012 2013 2014 2015 2016

Number of Entries Participation and Performance

0.28

172 157 123

81

0.03 35 29

2010 2011 2012 2013 2014 2015 2016

Number of Classification Entries Errors (top-5) Participation and Performance

0.28 0.66

172 157 123

81

0.03 0.23 35 29

2010 2011 2012 2013 2014 2015 2016

Number of Classification Average Precision Entries Errors (top-5) For Object Detection What we did to make better Lack of Details Lack of Details…ILSVRC Detection Challenge

PASCAL ILSVRC Statistics VOC 2012 2013

Object classes 20 10x 200

Images 5.7K 70x 395K

Training

Objects 13.6K 25x 345K Evaluation of ILSVRC Detection Need to annotate the presence of all classes (to penalize false detections)

Table Chair Horse Dog Cat Bird # images: 400K # classes: 200 + + - - - - # annotations = 80M! + - - - + - + + - - - - Evaluation of ILSVRC Detection Hierarchical annotation

J. Deng, O. Russakovsky, J. Krause, M. Bernstein, A. Berg, & L. Fei-Fei. CHI, 2014 What does classifying 10K+ classes tell us?

J. Deng, A. Berg & L. Fei-Fei, ECCV, 2010

Fine-Grained Recognition

“Cardigan Welsh Corgi” “Pembroke Welsh Corgi” Fine-Grained Recognition

cars [Gebru, Krause, Deng, Fei-Fei, CHI 2017]

2567 classes 700k images Expected Outcomes

Machine learning ImageNet becomes a Breakthroughs in advances and changes benchmark object recognition dramatically Unexpected Outcomes Neural Nets are Cool Again!

13,259 Citations

Krizhevsky, Sutskever & Hinton, NIPS 2012 …And Cooler and Cooler 

“AlexNet” “GoogLeNet” “VGG Net” “ResNet”

[Krizhevsky et al. NIPS 2012] [Szegedy et al. CVPR 2015] [Simonyan & Zisserman, [He et al. CVPR 2016] ICLR 2015] Neural Nets

A Deep Learning Revolution

GPUs Ontological Structure Structure Not Used as Much Thing

is a

Animalia

Chordate Arthropoda

Mammal Insect Wombat is a Primate Carnivora Diptera

Marsupial Hominidae Pongidae Felidae Muscidae

Homo Pan Felis Musca

is a

Sapiens Troglodytes Domestica Leo Domestica

Wombat Human Chimpanzee House Cat Lion Housefly

Deng, Krause, Berg & Fei-Fei, CVPR 2012 Thing Thing

is a

Animalia Animal

Chordate Arthropoda

Mammal Insect Mammal Wombat is a Primate Carnivora Diptera

Marsupial Hominidae Pongidae Felidae Muscidae Marsupial

Homo Pan Felis Musca

is a

Sapiens Troglodytes Domestica Leo Domestica

Wombat Human Chimpanzee House Cat Lion Housefly Wombat

Deng, Krause, Berg & Fei-Fei, CVPR 2012 Thing

is a

Animalia

Chordate Arthropoda

Mammal Insect WombatMaximize Specificity ( f ) is a Primate Carnivora Diptera Subject to Accuracy ( f ) ≥ 1 - ε Marsupial Hominidae Pongidae Felidae Muscidae

Homo Pan Felis Musca

is a

Sapiens Troglodytes Domestica Leo Domestica

Wombat Human Chimpanzee House Cat Lion Housefly

Deng, Krause, Berg & Fei-Fei, CVPR 2012 Optimizing with a Knowledge Ontology Results in Big Gains in Information at Arbitrary Accuracy

Our Model

Deng, Krause, Berg & Fei-Fei, CVPR 2012 Relatively Few Works Have Used Ontology

ECCV 2012 Best paper Award

Kuettel, Guillaumin, Ferrari. Segmentation Propagation in ImageNet. ECCV 2012 Most works still use 1M images to do pre-training

1M Images

15M Images Total “First, we find that the performance on vision tasks still increases linearly with orders of magnitude of training data size.”

C. Sun et al, 2017 How Humans Compare

Andrej Karpathy. http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-/ How Humans Compare

Human GoogLeNet

5.1% 6.8% Top-5 error rate Top-5 error rate

Susceptible to: Susceptible to: • Fine-grained recognition • Small, thin objects • Class unawareness • Image filters • Insufficient training data • Abstract representations • Miscellaneous sources

Andrej Karpathy. http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/ What Lies Ahead Moving from object recognition…

room

person

person person person person scale …to human-level understanding.

room

person

Watching and laughing person Wants Standing on to weigh person himself scale Stepping on

Wants to play a prank

Stepping on a scale adds weight and ups the reading. Inverse Graphics

Image credit: https://www.youtube.com/watch?v=ip-KIzQmcBo (Oliver Villar) lady

ImageNet: Deng et al. 2009; COCO: Lin et al. 2014 sky snow building leaves vest ski bag sunglasses jacket head lady pole hat tree glove equipment boots coat …

“A lady in pink dress is skiing.”

COCO: Lin et al. 2014 sky snow building leaves vest ski bag sunglasses jacket head lady pole hat tree glove equipment boots coat … “A lady in pink dress is skiing.”

“A man standing.” “A clear blue sky at a ski resort.” “A snowy hill is in front of pine trees.” “There are several pine trees.” “A group of people getting ready to ski.”

Q: What is the man in the center doing? A: Standing on a ski. Q: What is the color of the sky? A: Blue Q: Where are the pine trees? A: Behind the hill.

entire universe of images

[Johnson et al., CVPR 2015] Visual Genome Dataset A dataset, a knowledge base, an ongoing effort to connect structural image concepts to language.

Specs Goals • 108,249 images (COCO images) • Beyond nouns • 4.2M image descriptions – Objects, verbs, attibutes • 1.8M Visual QA (7W) • Beyond object classification • 1.4M objects, 75.7K obj. classes – Relationships and contexts • 1.5M relationships, 40.5K rel. classes • Sentences and QAs • 1.7M attributes, 40.5K attr. classes • From Perception to Cognition • Vision and language correspondences • Everything mapped to WordNet Synset

Krishna et al. IJCV 2016 Visual Genome Dataset A dataset, a knowledge base, an ongoing effort to connect structural image concepts to language.

DenseCap & Relationship Image Visual Q&A Paragraph Prediction Retrieval w/ Zhu et al. CVPR’16 Generation Krishna et al. Scene Graphs Karpathy et al. CVPR’16 ECCV’16 Johnson et al. Krause et al. CVPR’17 CVPR’15 Xu et al. CVPR’17

Q: What is the person sitting on the right of the elephant wearing? A: A blue shirt.

Krishna et al. IJCV 2016 Visual Genome Dataset A dataset, a knowledge base, an ongoing effort to connect structural image concepts to language.

DenseCap & Relationship Image Visual Q&A WorkshopParagraph on VisualPrediction UnderstandingRetrieval w/ by LearningZhu et al. CVPR’16 from Generation Krishna et al. Scene Graphs Karpathy et al. CVPR’16 ECCV’16 Web DataJohnson 2017 et al. Krause et al. CVPR’17 CVPR’15 Xu et al. CVPR’17 26 July 2017 | Honolulu, Hawaii in conjunction with CVPR 2017

http://www.vision.ee.ethz.ch/webvision/workshop.html

Q: What is the person sitting on the right of the elephant wearing? A: A blue shirt.

Krishna et al. IJCV 2016 The Future of Vision and Intelligence

Vision

Language Agency: The integration of perception, understanding Understanding and action

Action

81 Eight Years of Competitions

10× 3× 2010-2017 reduction of image improvement of classification error detection precision What Happens Now?

We’re passing the Why? baton to Kaggle: a image-net.org democratizing community of more remains live at data is vital to than 1M data Stanford. democratizing AI. scientists. What Happens Now?

ImageNet Object Localization Challenge ImageNet Object Detection Challenge ImageNet Object Detection from Video Challenge Contributors/Friends/Advisors Alex Berg Derek Hoiem Eunbyung Park Michael Bernstein Zhiheng Huang Chuck Rosenberg Edward Chang Andrej Karpathy Olga Russakovksy Brendan Collins Aditya Khosla Sanjeev Satheesh Jia Deng Jonathan Krause Richard Socher Minh Do Fei-Fei Li Hao Su Wei Dong Kai Li Zhe Wang Alexei Efros Li-Jia Li Andrew Zisserman Mark Everingham Wei Liu Christiane Fellbaum Sean Ma 49k Amazon Mechanical Turk Workers Adam Finkelstein Xiaojuan Ma Thomas Funkhouser Jitendra Malik Timnit Gebru Dan Osherson “This is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning.”

W I N S T O N C H U R C H I L L