<<

Machine Learning in Vision

Fei-Fei Li What is (computer) vision?

• When we “see” something, what does it involve? • Take a picture with a camera, it is just a bunch of colored dots (pixels) • Want to make understand images • Looks easy, but not really…

Image (or video) Sensing device Interpreting device Interpretations

Corn/mature corn jn a cornfield/ plant/blue sky in the background Etc. What is it related to?

Biology Psychology

Neuroscience Information Cognitive sciences Computer Science

Computer Vision Speech Information retrieval

Machine learning

Physics Maths Quiz? What about this? A picture is worth a thousand words. --- Confucius or Printers’ Ink Ad (1921) A picture is worth a thousand words. --- Confucius or Printers’ Ink Ad (1921)

horizontal lines textured

blue on the top vertical white

shadow to the left porous

oblique large green patches A picture is worth a thousand words. --- Confucius or Printers’ Ink Ad (1921)

clear sky autumn leaves

building court yard bicycles

trees campus day time

talking outdoor people Today: machine learning methods for object recognition outline

• Intro to object categorization • Brief overview – Generative – Discriminative • Generative models • Discriminative models How many object categories are there?

Biederman 1987 Challenges 1: view point variation

Michelangelo 1475-1564 Challenges 2: illumination

slide credit: S. Ullman Challenges 3: occlusion

Magritte, 1957 Challenges 4: scale Challenges 5: deformation

Xu, Beihong 1943 Challenges 6: background clutter

Klimt, 1913 History: single object recognition History: single object recognition

• Lowe, et al. 1999, 2003 • Mahamud and Herbert, 2000 • Ferrari, Tuytelaars, and Van Gool, 2004 • Rothganger, Lazebnik, and Ponce, 2004 • Moreels and Perona, 2005 •… Challenges 7: intra-class variation ObjectObject categorization:categorization: thethe statisticalstatistical viewpointviewpoint zebrap image)|( vs. (nop ezebra|imag )

• Bayes rule: zebrap image)|( imagep zebra)|( zebrap )( = ⋅ zebranop image)|( zebranoimagep )|( zebranop )(

posterior ratio likelihood ratio prior ratio ObjectObject categorization:categorization: thethe statisticalstatistical viewpointviewpoint

zebrap image)|( imagep zebra)|( zebrap )( = ⋅ zebranop image)|( zebranoimagep )|( zebranop )(

posterior ratio likelihood ratio prior ratio

• Discriminative methods model posterior

• Generative methods model likelihood and prior Discriminative

zebrap image)|( • Direct modeling of zebranop image)|(

Decision Zebra boundary Non-zebra Generative • Model imagep zebra )|( and |( zebranoimagep )

imagep zebra)|( zebranoimagep )|(

Low Middle

High MiddleÆLow ThreeThree mainmain issuesissues

• Representation – How to represent an object category

• Learning – How to form the classifier, given training data

• Recognition – How the classifier is to be used on novel data Representation

– Generative / discriminative / hybrid Representation

– Generative / discriminative / hybrid – Appearance only or location and appearance Representation

– Generative / discriminative / hybrid – Appearance only or location and appearance – Invariances • View point • Illumination • Occlusion •Scale • Deformation • Clutter •etc. Representation

– Generative / discriminative / hybrid – Appearance only or location and appearance – invariances – Part-based or global w/sub-window Representation

– Generative / discriminative / hybrid – Appearance only or location and appearance – invariances – Parts or global w/sub- window – Use set of features or each pixel in image Learning

– Unclear how to model categories, so we learn what distinguishes them rather than manually specify the difference -- hence current interest in machine learning Learning

– Unclear how to model categories, so we learn what distinguishes them rather than manually specify the difference -- hence current interest in machine learning) – Methods of training: generative vs. discriminative

5 p(C |x) p(C |x) p(x|C ) 1 2 2 4 1

0.8 3 0.6 2

class densities p(x|C ) 0.4 1

1 posterior probabilities 0.2

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 x x Learning

– Unclear how to model categories, so we learn what distinguishes them rather than manually specify the difference -- hence current interest in machine learning) – What are you maximizing? Likelihood (Gen.) or performances on train/validation set (Disc.) – Level of supervision • Manual segmentation; bounding box; image labels; noisy labels Contains a motorbike Learning

– Unclear how to model categories, so we learn what distinguishes them rather than manually specify the difference -- hence current interest in machine learning) – What are you maximizing? Likelihood (Gen.) or performances on train/validation set (Disc.) – Level of supervision • Manual segmentation; bounding box; image labels; noisy labels – Batch/incremental (on category and image level; user-feedback ) Learning

– Unclear how to model categories, so we learn what distinguishes them rather than manually specify the difference -- hence current interest in machine learning) – What are you maximizing? Likelihood (Gen.) or performances on train/validation set (Disc.) – Level of supervision • Manual segmentation; bounding box; image labels; noisy labels – Batch/incremental (on category and image level; user-feedback ) – Training images: • Issue of • Negative images for discriminative methods Learning

– Unclear how to model categories, so we learn what distinguishes them rather than manually specify the difference -- hence current interest in machine learning) – What are you maximizing? Likelihood (Gen.) or performances on train/validation set (Disc.) – Level of supervision • Manual segmentation; bounding box; image labels; noisy labels – Batch/incremental (on category and image level; user-feedback ) – Training images: • Issue of overfitting • Negative images for discriminative methods –Priors Recognition

– Scale / orientation range to search over – Speed Bag-of-words models ObjectObject BagBag ofof ‘‘wordswords’’ AnalogyAnalogy toto documentsdocuments

Of all the sensory impressions proceeding to China is forecasting a trade surplus of $90bn the brain, the visual experiences are the (£51bn) to $100bn this year, a threefold dominant ones. Our of the world increase on 2004's $32bn. The Commerce around us is based essentially on the Ministry said the surplus would be created by messages that reach the brain from our eyes. a predicted 30% jump in exports to $750bn, For a long time it was thought that the retinal compared with a 18% rise in imports to image was transmittedsensory, poin tbrain, by point to visual $660bn. The figuresChina, are likely trade, to further centers in thevisual, brain; the perception, cerebral cortex was a annoy thesurplus, US, which has commerce, long argued that movie screen, so to speak, upon which the China's exports are unfairly helped by a image inretinal, the eye was cerebral projected. Throughcortex, the deliberatelyexports, undervalu edimports, yuan. Beijing US, discoveries ofeye, Hubel cell, and Wiesel optical we now agrees theyuan, surplu bank,s is too high, domestic, but says the know that behind the origin of the visual yuan is only one factor. Bank of China perception in thenerve, brain there image is a considerably governor Zhouforeign, Xiaochuan increase, said the country more complicatedHubel, course ofWiesel events. By also needed to dotrade, more to value boost domestic following the visual impulses along their path demand so more goods stayed within the to the various cell layers of the optical cortex, country. China increased the value of the Hubel and Wiesel have been able to yuan against the dollar by 2.1% in July and demonstrate that the message about the permitted it to trade within a narrow band, but image falling on the retina undergoes a step- the US wants the yuan to be allowed to trade wise analysis in a system of nerve cells freely. However, Beijing has made it clear that stored in columns. In this system each cell it will take its time and tread carefully before has its specific function and is responsible for allowing the yuan to rise further in value. a specific detail in the pattern of the retinal image.

learninglearning recognitionrecognition

codewords dictionary feature detection & representation image representation

categorycategory modelsmodels categorycategory (and/or)(and/or) classifiersclassifiers decisiondecision RepresentationRepresentation

2.2. codewords dictionary feature detection 1.1. & representation

image representation 3.3. 1.Feature1.Feature detectiondetection andand representationrepresentation 1.Feature1.Feature detectiondetection andand representationrepresentation

Compute SIFT Normalize descriptor patch [Lowe’99] Detect patches [Mikojaczyk and Schmid ’02] [Matas et al. ’02] [Sivic et al. ’03]

Slide credit: Josef Sivic 1.Feature1.Feature detectiondetection andand representationrepresentation

… 2.2. CodewordsCodewords dictionarydictionary formationformation

… 2.2. CodewordsCodewords dictionarydictionary formationformation

Vector quantization

Slide credit: Josef Sivic 2.2. CodewordsCodewords dictionarydictionary formationformation

Fei-Fei et al. 2005 3.3. ImageImage representationrepresentation frequency

…..

codewords RepresentationRepresentation

2.2. codewords dictionary feature detection 1.1. & representation

image representation 3.3. LearningLearning andand RecognitionRecognition

codewords dictionary

categorycategory modelsmodels categorycategory (and/or)(and/or) classifiersclassifiers decisiondecision 22 casecase studiesstudies

1. Naïve Bayes classifier – Csurka et al. 2004

2. Hierarchical Bayesian text models (pLSA and LDA) – Background: Hoffman 2001, Blei et al. 2004 – Object categorization: Sivic et al. 2005, Sudderth et al. 2005 – Natural scene categorization: Fei-Fei et al. 2005 First,First, somesome notationsnotations

•wn: each patch in an image

–wn = [0,0,…1,…,0,0]T • w: a collection of all N patches in an image

– w = [w1,w2,…,wN] th •dj: the j image in an image collection • c: category of the image • z: theme or topic of the patch CaseCase #1:#1: thethe NaNaïïveve BayesBayes modelmodel

c w N

N ∗ c = maxarg wcp )|( ∝ cwpcp )|()( = ∏ n cwpcp )|()( c n=1

Object class Prior prob. of Image likelihood decision the object classes given the class

Csurka et al. 2004 CaseCase #2:#2: HierarchicalHierarchical BayesianBayesian texttext modelsmodels

Probabilistic Latent Semantic Analysis (pLSA)

d z w N D Hoffman, 2001

Latent Dirichlet Allocation (LDA)

c π z w N D Blei et al., 2001 CaseCase #2:#2: HierarchicalHierarchical BayesianBayesian texttext modelsmodels

Probabilistic Latent Semantic Analysis (pLSA)

d z w N D

“face”

Sivic et al. ICCV 2005 CaseCase #2:#2: HierarchicalHierarchical BayesianBayesian texttext modelsmodels

“beach”

Latent Dirichlet Allocation (LDA)

c π z w N D

Fei-Fei et al. ICCV 2005 Another application

• Human action classification InvarianceInvariance issuesissues • Scale and rotation – Implicit – Detectors and descriptors

Kadir and Brady. 2003 InvarianceInvariance issuesissues • Scale and rotation • Occlusion – Implicit in the models – Codeword distribution: small variations – (In theory) Theme (z) distribution: different occlusion patterns InvarianceInvariance issuesissues • Scale and rotation • Occlusion • Translation – Encode (relative) location information

Sudderth et al. 2005 InvarianceInvariance issuesissues • Scale and rotation • Occlusion • Translation • View point (in theory) – Codewords: detector and descriptor – Theme distributions: different view points

Fergus et al. 2005 ModelModel propertiesproperties

Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmittedsensory, poin tbrain, by point to visual centers in thevisual, brain; the perception, cerebral cortex was a • Intuitive movie screen, so to speak, upon which the image inretinal, the eye was cerebral projected. Throughcortex, the – Analogy to documents discoveries ofeye, Hubel cell, and Wiesel optical we now know that behind the origin of the in thenerve, brain there image is a considerably more complicatedHubel, course ofWiesel events. By following the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the message about the image falling on the retina undergoes a step- wise analysis in a system of nerve cells stored in columns. In this system each cell has its specific function and is responsible for a specific detail in the pattern of the retinal image. ModelModel propertiesproperties

• Intuitive • (Could use) generative models – Convenient for weakly- or un-supervised training – Prior information – Hierarchical Bayesian framework Sivic et al., 2005, Sudderth et al., 2005 ModelModel propertiesproperties

• Intuitive • (Could use) generative models • Learning and recognition relatively fast – Compare to other methods WeaknessWeakness ofof thethe modelmodel

• No rigorous geometric information of the object components • It’s intuitive to most of us that objects are made of parts – no such information • Not extensively tested yet for – View point invariance – • Segmentation and localization unclear part-based models

Slides courtesy to Rob Fergus for “part-based models” One-shot learning

Fei-Fei et al. ‘03, ‘04, ‘06 of object categories P. Bruegel, 1562

One-shot learning

Fei-Fei et al. ‘03, ‘04, ‘06 of object categories model representation

One-shot learning

Fei-Fei et al. ‘03, ‘04, ‘06 of object categories X (location)

(x,y) coords. of region center

A (appearance)

normalize

tch pa x11 11 c1 c2 ….. Projection onto PCA c10 The Generative Model X (location)

(x,y) coords. of region center

μX ΓX μA ΓA A (appearance)

h normalize

X A tch pa x11 I 11 c1 c2 ….. Projection onto PCA basis c10 The Generative Model

μX ΓX μA ΓA parameters

hidden variable h

X A observed variables I The Generative Model

ML/MAP μX ΓX μA ΓA

h θ1

X A I θ2

θn where θ = {µX, ΓX, µA, ΓA}

Weber et al. ’98 ’00, Fergus et al. ’03 The Generative Model

X X μ Γ ML/MAP shape model

θ1 μA ΓA appearance model θ2

θn where θ = {µX, ΓX, µA, ΓA} The Generative Model

ML/MAP

μX ΓX μA ΓA

θ1 h

θ I X A 2

θn The Generative Model

Bayesian

X X A A m0 a0 m0 a0 X X A A β0 B0 β0 B0

μX ΓX μA ΓA P θ1 h

θ I X A 2

θn

Parameters to estimate: {mX, βX, aX, BX, mA, βA, aA, BA} Fei-Fei et al. ‘03, ‘04, ‘06 i.e. parameters of Normal-Wishart distribution The Generative Model

m X a X m A A 0 0 0 a0 priors X X A A β0 B0 β0 B0

μX ΓX μA ΓA parameters P

h

I X A

Fei-Fei et al. ‘03, ‘04, ‘06 The Generative Model

m X a X m A A 0 0 0 a0 priors X X A A β0 B0 β0 B0

μX ΓX μA ΓA P

h

I X A Prior distribution

Fei-Fei et al. ‘03, ‘04, ‘06 1. human vision

3. learning & inferences

One-shot learning 2. model of object categories representation 4. evaluation & dataset & application learning & inferences

No labeling No segmentation No alignment

One-shot learning

Fei-Fei et al. 2003, 2004, 2006 of object categories learning & inferences

Bayesian

θ1

θ2

θn One-shot learning

Fei-Fei et al. 2003, 2004, 2006 of object categories Random VariationalVariational EMEM initialization

M-Step E-Step

new estimate of p(θ|train)

prior knowledge of p(θ) Attias, Jordan, Hinton etc. evaluation & dataset

One-shot learning

Fei-Fei et al. 2004, 2006a, 2006b of object categories evaluation & dataset -- Caltech 101 Dataset

One-shot learning

Fei-Fei et al. 2004, 2006a, 2006b of object categories evaluation & dataset -- Caltech 101 Dataset

One-shot learning

Fei-Fei et al. 2004, 2006a, 2006b of object categories Part 3: discriminative methods Discriminative methods and recognition is formulated as a classification problem. The image is partitioned into a set of overlapping windows … and a decision is taken at each window about if it contains a target object or not.

Decision Background boundary Where are the screens?

Bag of image patches Computer screen In some feature space Discriminative vs. generative

• Generative model 0.1 (The artist) 0.05

0 0 10 20 30 40 50 60 70 x = data

• Discriminative model (The lousy painter) 1 0.5

0 0 10 20 30 40 50 60 70 x = data

• Classification function 1

-1

0 10 20 30 40 50 60 70 80 x = data Discriminative methods

Nearest neighbor Neural networks

106 examples Shakhnarovich, Viola, Darrell 2003 LeCun, Bottou, Bengio, Haffner 1998 Berg, Berg, Malik 2005 Rowley, Baluja, Kanade 1998 … …

Support Vector Machines and Kernels Conditional Random Fields

Guyon, Vapnik McCallum, Freitag, Pereira 2000 Heisele, Serre, Poggio, 2001 Kumar, Hebert 2003 … … Formulation • Formulation: binary classification … … Features x = x1 x2 x3 xN xN+1 xN+2 … xN+M Labels y = -1 +1 -1 -1 ?? ?

Training data: each image patch is labeled Test data as containing the object or background • Classification function

Where belongs to some family of functions • Minimize misclassification error (Not that simple: we need some guarantees that there will be generalization) Overview of section

• Object detection with classifiers

• Boosting – Gentle boosting – Weak detectors – Object model – Object detection

• Multiclass object detection Why boosting?

• A simple for learning robust classifiers – Freund & Shapire, 1995 – Friedman, Hastie, Tibshhirani, 1998

• Provides efficient algorithm for sparse visual – Tieu & Viola, 2000 – Viola & Jones, 2003

• Easy to implement, not requires external optimization tools. Boosting • Defines a classifier using an additive model:

Strong Weak classifier classifier Weight Features vector Boosting • Defines a classifier using an additive model:

Strong Weak classifier classifier Weight Features vector • We need to define a family of weak classifiers

from a family of weak classifiers Boosting • It is a sequential procedure:

xt=1 Each data point has xt a class label: xt=2 +1 ( ) yt = -1 ( )

and a weight:

wt =1 Toy example

Weak learners from the family of lines

Each data point has a class label: +1 ( ) yt = -1 ( )

and a weight:

wt =1

h => p(error) = 0.5 it is at chance Toy example

Each data point has a class label: +1 ( ) yt = -1 ( )

and a weight:

wt =1

This one seems to be the best This is a ‘weak classifier’: It performs slightly better than chance. Toy example

Each data point has a class label: +1 ( ) yt = -1 ( )

We update the weights:

wt wt exp{-yt Ht}

We set a new problem for which the previous weak classifier performs at chance again Toy example

Each data point has a class label: +1 ( ) yt = -1 ( )

We update the weights:

wt wt exp{-yt Ht}

We set a new problem for which the previous weak classifier performs at chance again Toy example

Each data point has a class label: +1 ( ) yt = -1 ( )

We update the weights:

wt wt exp{-yt Ht}

We set a new problem for which the previous weak classifier performs at chance again Toy example

Each data point has a class label: +1 ( ) yt = -1 ( )

We update the weights:

wt wt exp{-yt Ht}

We set a new problem for which the previous weak classifier performs at chance again Toy example

f 1 f2

f4

f3

The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers. From images to features: Weak detectors We will now define a family of visual features that can be used as weak classifiers (“weak detectors”)

Takes image as input and the output is binary response. The output is a weak detector. Weak detectors Textures of textures Tieu and Viola, CVPR 2000

Every combination of three filters generates a different feature

This gives thousands of features. Boosting selects a sparse subset, so computations on test time are very efficient. Boosting also avoids overfitting to some extend. Weak detectors Haar filters and integral image Viola and Jones, ICCV 2001

The average intensity in the block is computed with four sums independently of the block size. Weak detectors

Other weak detectors: • Carmichael, Hebert 2004 • Yuille, Snow, Nitzbert, 1998 • Amit, Geman 1998 • Papageorgiou, Poggio, 2000 • Heisele, Serre, Poggio, 2001 • Agarwal, Awan, Roth, 2004 • Schneiderman, Kanade 2004 •… Weak detectors Part based: similar to part-based generative models. We create weak detectors by using parts and voting for the object center location

Car model Screen model

These features are used for the detector on the course web site. Weak detectors First we collect a set of part templates from a set of training objects. Vidal-Naquet, Ullman (2003)

… Weak detectors We now define a family of “weak detectors” as:

==*

Better than chance Weak detectors We can do a better job using filtered images

* = = * =

Still a weak detector but better than before Training First we evaluate all the N features on all the training images.

Then, we sample the feature outputs on the object center and at random locations in the background: Representation and object model

Selected features for the screen detector … …

123 410 100

Lousy painter Representation and object model Selected features for the car detector … …

1 2 3 4 10 100 Overview of section

• Object detection with classifiers

• Boosting – Gentle boosting – Weak detectors – Object model – Object detection

• Multiclass object detection Example: screen detection Feature output Example: screen detection Feature Thresholded output output

Weak ‘detector’ Produces many false alarms. Example: screen detection Feature Thresholded Strong classifier output output at iteration 1 Example: screen detection Feature Thresholded Strong output output classifier

Second weak ‘detector’ Produces a different set of false alarms. Example: screen detection Feature Thresholded Strong output output classifier

+

Strong classifier at iteration 2 Example: screen detection Feature Thresholded Strong output output classifier

+ …

Strong classifier at iteration 10 Example: screen detection Feature Thresholded Strong output output classifier

+ …

Adding features

Final classification

Strong classifier at iteration 200 applications

Document Analysis

Digit recognition, AT&T labs http://www.research.att.com/~yann/ Medical Robotics Toys and Finger prints Security Searching the web