Fundamentals of Media Processing

Lecturer: 池畑 諭(Prof. IKEHATA Satoshi) 児玉 和也(Prof. KODAMA Kazuya)

Support: 佐藤 真一(Prof. SATO Shinichi) 孟 洋(Prof. MO Hiroshi) Course Overview (15 classes in total)

1-10 Machine Learning by Prof. Satoshi Ikehata 11-15 Signal Processing by Prof. Kazuya Kodama

Grading will be based on the final report.

Fundamentals of Media Processing, Deep Learning Chapter 1-9 (out of 20)

An introduction to a broad range of topics in deep learning, covering mathematical and conceptual background, deep learning techniques used in industry, and research perspectives.

• Due to my background, I will mainly talk about “image” • I will introduce some applications beyond this book

Fundamentals of Media Processing, Deep Learning 10/16 (Today) Introduction Chap. 1

Basic of Machine Learning (Maybe for beginners)

10/23 Basic mathematics (1) (Linear algebra, probability, numerical computation) Chap. 2,3,4

10/30 Basic mathematics (2) (Linear algebra, probability, numerical computation) Chap. 2,3,4

11/6 Machine Learning Basics (1) Chap. 5

11/13 Machine Learning Basics (2) Chap. 5

Basic of Deep Learning

11/20 Deep Feedforward Networks Chap. 6

11/27 Regularization and Deep Learning Chap. 7

12/4 Optimization for Training Deep Models Chap. 8

CNN and its Application Chap. 9 and more 12/11 Convolutional Neural Networks and Its Application (1) 12/18 Convolutional Neural Networks and Its Application (2) Chap. 9 and more

Fundamentals of Media Processing, Deep Learning Linear Algebra

Fundamentals of Media Processing, Deep Learning Scalars, Vectors, Matrices and Tensors

Scalars 푎 = 1 0 풂 = = 0,1 푇 (1-D) Vector 1

0 1 0 2 (2-D) 퐴 = 퐴푇 = 퐴 ∈ ℝ2×2 2 3 1 3

퐴1,1 = 0, 퐴1,2 = 1, 퐴2,1 = 2, 퐴2,2 = 3

퐴 (N-D) Tensor 푖,푗,푘…

Fundamentals of Media Processing, Deep Learning Multiplying Matrices and Vectors

풙푇풚 = 풚푇풙 퐴 퐵 + 퐶 = 퐴퐵 + 퐴퐶

풙 ∗ 풙 = 풙푇풙 = 풙 ퟐ 퐴퐵 푇 = 퐵푇퐴푇

푇 풙 2 = 1 Unit vector 풙 풚 = 0 Orthogonal −1 푇 Symmetric matrix 퐴퐴 = 퐼 Inverse matrix 퐴 = 퐴

푇 푇 퐴 퐴 = 퐴퐴 = 퐼 Orthogonal matrix Tr 퐴 = ෍ 퐴푖,푖 Trace operator 푖

L1 Norm 풙 ퟏ = ෍ |푥푖| L∞ Norm 풙 ∞ = max|푥푖| 풊 풊 ퟏ ퟐ ퟐ Frobenius Norm 퐴 = ෍ 퐴2 L2 Norm 풙 ퟐ = ෍ |푥푖| 퐹 푖,푗 풊 푖,푗

푇 L0 Norm 풙 ퟎ = # of nonzero entries 퐴 퐹 = Tr 퐴퐴

Fundamentals of Media Processing, Deep Learning Matrix Decomposition

◼ Decomposition of matrices shows us information about their functional properties that is not obvious from the representation of the matrix

⚫ Eigendecomposition ⚫ Singular value decomposition Differs in ⚫ LU decomposition ⚫ QR decomposition ◼ Applicable matrix ⚫ Jordan decomposition ◼ Decomposition type ⚫ Cholesky decomposition ◼ Application ⚫ Schur decomposition ⚫ Rank factorization ⚫ And more…

Fundamentals of Media Processing, Deep Learning Eigendecomposition

⚫ Eigenvector and eigenvalue: 풗: Eigenvector 퐴풗 = 휆풗 휆: Eigenvalue

⚫ Eigendecomposition of a diagonalizable matrix : 푉: A set of eigenvectors 퐴 = 푉diag 휆 푉−1 휆: A set of eigenvalues ⚫ Every symmetric matrix has a real, orthogonal eigendecomposition 퐴 = 푄훬푄푇

⚫ A matrix whose eigenvalues are all positive is positive definite, positive or zero is positive semidefinite

Fundamentals of Media Processing, Deep Learning What Eigenvalues and Eigenvectors are?

◼ Eigenvalues and eigenvectors feature prominently in the analysis of linear transformations. The prefix eigen- is adopted from the German word eigen for "proper", "characteristic". Originally utilized to study principal axes of the rotational motion of rigid bodies, eigenvalues and eigenvectors have a wide range of applications, for example in stability analysis, vibration analysis, atomic orbitals, facial recognition, and matrix diagonalization.

In this shear mapping the red arrow changes direction but the blue arrow does not. The blue arrow is an eigenvector of this shear mapping because it does not change direction, and since its length is unchanged, its eigenvalue is 1.

https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors Fundamentals of Media Processing, Deep Learning Calculation of Eigendecomposition

⚫ Simply solve a linear equation

퐴풗 = 휆풗 → 퐴 − 휆퐼 풗 = 0

⚫ 퐴 − 휆퐼 must not have inverse matrix, otherwise 푣 = 0

푛1 2 푛3 det 퐴 − 휆퐼 = 0 → 푝 휆 = 휆 − 휆1 휆 − 휆2 휆 − 휆3 …

⚫ Once 휆푠 are calculated, eigenvectors are calculated by:

퐴 − 휆퐼 풗 = 0

Fundamentals of Media Processing, Deep Learning Singular Value Decomposition (SVD)

⚫ Similar to eigendecomposition, but matrix need not be square. Singular value decomposition of a real matrix A is

퐴 = 푈퐷푉푇 퐴풖 = 𝜎풖 퐴푇풗 = 𝜎풗

⚫ The left singular vectors of 퐵 are the eigenvectors of 퐴퐴푇

⚫ The right singular vectors of 퐵 are the eigenvectors of 퐴푇퐴

⚫ SVD is useful for the matrix inversion of non- (pseudo inverse matrix) or rank approximation (E.g., find nearest matrix whose rank is k) or solving a homogenous system

Fundamentals of Media Processing, Deep Learning Solving Linear Systems ◼ Number of Eq = Number of Unknown (Determined)

푎 + 2b + 3c = 3 1 2 3 푎 3 −1 2푎 − b + c = 2 2 −1 1 푏 = 2 풙 = 퐴 풚 5 1 −2 푐 1 5a + b − 2c = 1 퐴 풙 풚 ◼ Number of Eq < Number of Unknown (Underdetermined)

푎 푎 + 2b + 3c = 3 1 2 3 3 푏 = ? 2푎 − b + c = 2 2 −1 1 푐 2 ◼ Number of Eq > Number of Unknown (Overdetermined) 푎 + 2b + 3c = 3 1 2 3 푎 3 2푎 − b + c = 2 2 −1 1 2 푏 = 5 1 −2 1 ? 5a + b − 2c = 1 푐 8 −1 6 0 8a − b + 6c = 0 Fundamentals of Media Processing, Deep Learning Moore-Penrose Pseudoinverse ◼ SVD gives the pseudoinverse of A as

+ + 푇 푇 + 1 퐴 = 푉퐷 푈 (퐴 = 푈퐷푉 , 퐷푖푖 = ) 퐷푖푖 ◼ The solution to any linear systems is given by 풙 = 퐴+풚 ◼ If the linear system is:

⚫ Determined: this is same as the inverse

⚫ Underdetermined: this gives us the solution with the smallest norm of x ⚫ Overdetermined: this gives us the solution with the smallest error

Fundamentals of Media Processing, Deep Learning The Homogeneous System

◼ We may want to solve the homogenous system that is defined as

퐴풙 = 0

◼ If det(퐴푇퐴) = 0, numerical solutions to a homogeneous system can be found with SVD or eigendecompositoin

퐴푥 = 0 퐴푇퐴푥 = 0 퐴푇퐴푣 = 휆푣 ⚫ Eigenvector corresponding to smallest eigen value (~=0) ⚫ Or, the column vector in V which is corresponding to the smallest singular value (퐴 = 푈퐷푉푇)

Fundamentals of Media Processing, Deep Learning Example: Structure-from-Motion

https://www.youtube.com/watch?v=i7ierVkXYa8

Fundamentals of Media Processing, Deep Learning Example: Structure-from-Motion

• Relative camera poses could be computed from the point correspondences on images • Given camera pose (R,t), and camera intrinsics (K), 3D points X is given by triangulation

푥1 = 퐾1 푅1 푇1 푋෨

푥2 = 퐾2 푅2 푇2 푋෨

퐴푋 = 0 Homogenous system

Fundamentals of Media Processing, Deep Learning Probability and Information Theory

Fundamentals of Media Processing, Deep Learning Why probability?

Most real problem is not deterministic.

Which is this picture about Dog or Cat?

Fundamentals of Media Processing, Deep Learning Three possible sources of uncertainty

⚫ Inherent stochasticity in the system

e.g., Randomly shuffled card

⚫ Incomplete observability e.g., Monty Hall problem

⚫ Incomplete modeling

e.g., A robot that only sees the discretized space

⚫ A random variable is a variable that can take on different values randomly. e. g., 푥 ∈ x

Fundamentals of Media Processing, Deep Learning Monty Hall problem

Fundamentals of Media Processing, Deep Learning Discrete case: Probability Mass Function

푃 푥 ⚫ The domain of P must be the set of all possible states of x

• ∀푥 ∈ x, 0 ≤ 푃 푥 ≤ 1. An impossible event has probability 0 and no state can be less probable than that. Likewise, an event that is guaranteed to happen has probability 1, and no state can have a greater chance of occurring

• σ푥∈x 푃 푥 = 1. We refer to this property as being normalized. Without this property, we could obtain probabilities greater than one by computing the probability of one of many events occurring

⚫ Example • Dice : P(x)=1/6, x is an event where f(x)=1,2,3,4,5,6

Fundamentals of Media Processing, Deep Learning Continuous case: Probability Density Function

푝 푥

⚫ The domain of 푝 must be the set of all possible states of x

• ∀푥 ∈ x, 푝 푥 ≥ 0. Note that we do not require 푝 푥 ≤ 1

• ∫ 푝 푥 푑푥 = 1 • Does not give the probability of a specific state directly e.g., p(0.0001) + p(0.0002) +….. >= 100%!

⚫ Example • What is the probability that randomly selected value within [0,1] is more than 0.5?

Fundamentals of Media Processing, Deep Learning Marginal Probability

◼ The probability distribution over the subset

⚫ ∀푥 ∈ x, P 푥 = σ푦 푃 푥, 푦 (Discrete) ⚫ 푝 푥 = ∫ 푝 푥, 푦 푑푦 (Continuous)

Table: The statistics about the student

Male Female

Tokyo 0.4 0.3 Q. How often students come from Tokyo?

Outside Tokyo 0.1 0.2

Fundamentals of Media Processing, Deep Learning Conditional Probability

◼ The probability of an event, given that some other event has happened ⚫ 푃 푦 푥 = 푃 푦, 푥 /푃(푥)

⚫ 푃 푦, 푥 = 푃 푦 푥 푃(푥)

◼ Example: The boy and girl problem

Mr. Jones has two children. One is a girl. What is the probability that the other is a boy? • Each child is either male or female. • Each child has the same chance of being male as of being female. • The sex of each child is independent of the sex of the other.

Fundamentals of Media Processing, Deep Learning Conditional Probability

[Boy/Boy, Boy/Girl, Girl/Boy, Girl/Girl] ⚫ 푃 푦 : The probability that “The other is a boy” ⚫ 푃 푥 : The probability that “One is a girl” ⚫ 푃 푦|푥 : The probability that “The other is a boy” when “One is a girl” ⚫ 푃 푦, 푥 : The probability that “One is a girl, the other is a boy”

푃 푦, 푥 2/4 2 푃 푦 푥 = = = 푃 푥 3/4 3

Fundamentals of Media Processing, Deep Learning Chain rule of Probability

1 (푛) 1 푛 푖 1 (푖−1) ⚫ 푃 푥 , … , 푥 = 푃 푥 Π푖=2푃(푥 |푥 , … , 푥 )

⚫ 푃 푎, 푏, 푐 = 푃 푎 푏, 푐 푃(푏, 푐)

⚫ 푃 푏, 푐 = 푃(푏|푐)푃(푐)

⚫ 푃 푎, 푏, 푐 = 푃 푎 푏, 푐 푃 푏 푐 푃(푐)

Fundamentals of Media Processing, Deep Learning Independence and Conditional Independence

◼ The random variables x and y are independent if their probability distribution can be expressed as a product of two independent factors ⚫ ∀푥 ∈ x, 푦 ∈ y, 푝 푥, 푦 = 푝 푥 푝(푦) ◼ Two random variables x and y are conditionally independent given random variable z if the conditional probability distribution over x and y factorizes in this way for every value of z

⚫ ∀푥 ∈ x, 푦 ∈ y, 푧 ∈ z, 푝 푥, 푦|푧 = 푝 푥|푧 푝(푦|푧)

푃 푅푒푑 ∩ 퐵푙푢푒 푌푒푙푙표푤 = 푃 푅 푌 푃(퐵|푌)

Fundamentals of Media Processing, Deep Learning Expectation, Variance and Covariance

◼ The expectation of some function f(x) with respect to P(x) or p(x) is mean value that f takes on when x is drawn from P or p

• Ε푥→푃 푓(푥) = σ푥 푃 푥 푓(푥)

• Ε푥→푝 푓(푥) = ∫ 푝 푥 푓 푥 푑푥

◼ The variance gives how much the values of a function of a random variable x vary as we sample different values of x from its probability distribution

• Var 푓(푥) = Ε 푓 푥 − Ε 푓 푥 2

◼ The covariance gives some sense of how much two values are linearly related to each other, as well as the scale of these variables:

• Cov 푓 푥 , 푔(푦) = Ε 푓 푥 − Ε 푓 푥 푔 푦 − Ε 푔 푦

• Cov 풙 푖,푗 = Cov 푥푖, 푥푗 Covariance matrix for 푛 × 푛 matrix

Fundamentals of Media Processing, Deep Learning Common Probability Distribution (1)

◼ Bernoulli distribution 푃 1 = Φ 푃 0 = 1 − Φ 푃 푥 = 휙푥 1 − Φ 1−푥

◼ Gaussian (Normal) distribution

1 1 풩 푥; 휇, 𝜎2 = exp − 푥 − 휇 2 2휋𝜎2 2𝜎2

The central limit theorem: The sum of many independent random variables is approximately normally distributed

Fundamentals of Media Processing, Deep Learning Common Probability Distribution (2)

◼ Multivariate normal distribution

1 1 푁 풙; 흁, Σ = exp − 풙 − 흁 푇Σ−1(풙 − 흁) 2휋 푛det(Σ) 2

https://notesonml.wordpress.com/2015/06/30/chapter-14-anomaly-detection-part-2- multivariate-gaussian-distribution/

Fundamentals of Media Processing, Deep Learning Common Probability Distribution (3)

◼ Exponential distribution

푝 푥; 휆 = 휆ퟏ푥≥0exp(−휆푥)

ퟏ푥≥0 assign zero to negative values of x

◼ Laplace distribution 1 푥 − 휇 Laplace 푥; 휇, 훾 = exp(− ) 2휆 훾

Fundamentals of Media Processing, Deep Learning Mixtures of Distributions

◼ Empirical Distribution 1 푝 푥 = ෍ 훿 푥 − 푥(푖) 푚 ⚫ 훿 is a dirac delta function ◼ Gaussian Mixture Model

2 푝 푥 = ෍ 휙푖푁 푥|휇푖, 𝜎푖 푖 휙푖: latent variable (weight of gaussian)

⚫ GMM is a universal approximator of densities of a distribution

Fundamentals of Media Processing, Deep Learning Application of GMM in Computer Vision

Background subtraction by GMM https://www.youtube.com/watch?v=KGal_NvwI7A Fundamentals of Media Processing, Deep Learning Useful Properties of Common Functions

◼ Logistic Sigmoid 1 𝜎 푥 = 1 + exp(−푥) ⚫ Good to produce [0,1] random values

◼ Softplus function

𝜍 푥 = log 1 + exp(−푥)

⚫ Good to prodce [0,∞] random values

Fundamentals of Media Processing, Deep Learning Bayes’ Rule

Prior probability likelihood 푃 푥 푃(푦|푥) 푃 푥, 푦 푃 푥 푦 = = Posterior probability 푃(푦) 푃(푦) Prior probability Factory Problem The entire output of a factory is produced on three machines. The three machines account for 20%, 30%, and 50% of the factory output. The fraction of defective items produced is 5% for the first machine; 3% for the second machine; and 1% for the third machine. If an item is chosen at random from the total output and is found to be defective, what is the probability that it was produced by the third machine?

Fundamentals of Media Processing, Deep Learning Factory Problem

The entire output of a factory is produced on three machines. The three machines account for 20%, 30%, and 50% of the factory output. The fraction of defective items produced is 5% for the first machine; 3% for the second machine; and 1% for the third machine. If an item is chosen at random from the total output and is found to be defective, what is the probability that it was produced by the third machine?

푃 푋퐴 = 0.2, 푃 푋퐵 = 0.3, 푃 푋퐶 = 0.5

푃 푌|푋퐴 = 0.05, 푃 푌|푋퐵 = 0.03, 푃 푌|푋퐶 = 0.01

푃 Y = 푃 푌 푋퐴 푃 푋퐴 + 푃 푌 푋퐵 푃(푋퐵)+ 푃 푌 푋퐶 푃(푋퐶) 푃 푋 푃 푌 푋 푃 푋 |Y = 퐶 퐶 = 5/24 퐶 푃 푌

Fundamentals of Media Processing, Deep Learning Fundamentals of Media Processing

Lecturer: 池畑 諭(Prof. IKEHATA Satoshi) 児玉 和也(Prof. KODAMA Kazuya)

Support: 佐藤 真一(Prof. SATO Shinichi) 孟 洋(Prof. MO Hiroshi) 10/16 (Today) Introduction Chap. 1

Basic of Machine Learning (Maybe for beginners)

10/23 Basic mathematics (1) (Linear algebra, probability, numerical computation) Chap. 2,3,4

10/30 Basic mathematics (2) (Linear algebra, probability, numerical computation) Chap. 2,3,4

11/6 Machine Learning Basics (1) Chap. 5

11/13 Machine Learning Basics (2) Chap. 5

Basic of Deep Learning

11/20 Deep Feedforward Networks Chap. 6

11/27 Regularization and Deep Learning Chap. 7

12/4 Optimization for Training Deep Models Chap. 8

CNN and its Application Chap. 9 and more 12/11 Convolutional Neural Networks and Its Application (1) 12/18 Convolutional Neural Networks and Its Application (2) Chap. 9 and more

Fundamentals of Media Processing, Deep Learning Last Week

◼ Linear Algebra ⚫ Matrix Decomposition (eigendecomposition, SVD) ⚫ Solving linear systems

◼ Probability ⚫ Random variable 푥, probability mass function 푃(푥), probability density function 푝(푥) ⚫ Marginal probability, conditional probability ⚫ Expectation, variance, covariance ⚫ Common probabilistic distribution (e.g., Normal distribution, Laplace distribution) ⚫ Mixture of Gaussian ⚫ Bays’ rule

Fundamentals of Media Processing, Deep Learning Information Theory

Information theory studies the quantification, storage, and communication of information. It was originally proposed by Claude E. Shannon in 1948 to find fundamental limits on signal processing and communication operations such as data compression, in a landmark paper entitled “A Mathematical Theory of Communication”.

A key measure in information theory is entropy. Entropy quantifies the amount of uncertainty involved in the value of a random variable or the outcome of a random process. Some other important measures in information theory are mutual information, channel capacity, error exponents, and relative entropy.

The field is at the intersection of mathematics, statistics, computer science, physics, neurobiology, information engineering, and electrical engineering. The theory has also found applications in other , including statistical inference, natural language processing, cryptography, neurobiology, human vision, the evolution and function of molecular codes (bioinformatics), model selection in statistics, thermal physics, quantum computing, linguistics, plagiarism detection, pattern recognition, and anomaly detection.

https://en.wikipedia.org/wiki/Information_theory

Fundamentals of Media Processing, Deep Learning Information Theory ◼ Self-information (for single outcome) ⚫ Likely event has low information, less likely event has higher information In units of nats or bits: amount of information gained by observing an 퐼 푥 = − log 푃(푥) event of probability 1/e or 1/2 For example, identifying the outcome of a fair coin flip (with two equally likely outcomes) provides less information (lower entropy) than specifying the outcome from a roll of a die (with six equally likely outcomes). ◼ Shanon entropy (amount of uncertainty in an entire probability distribution)

퐻 푥 = 퐸푥~푃 퐼 푥 = −퐸푥~푃[log 푃(푥)] ⚫ Known as differential entropy for p(x)

Fundamentals of Media Processing, Deep Learning Example

퐻 푥 = 퐸푥~푃 퐼 푥 = − ෍ 푝푖 log 푝푖 푖=1

0 (Dirac delta), 174 (Gaussian), and 431 (uniform). https://medium.com/swlh/shannon-entropy-in-the-context-of-machine-learning-and-ai-24aee2709e32

Fundamentals of Media Processing, Deep Learning Kullback-Leibler (KL) divergence

푃 푥 퐷 (푃| 푄 = 퐸 log = 퐸 log 푃 푥 − log 푄(푥) 퐾퐿 푥~푃 푄 푥 푥~푃

◼ The difference of two distributions (higher is different) • KL divergence is positive or zero only when P and Q are the same distribution • Often used for model fitting (e.g., fitting GMM (Q(x)) on P(x) • Asymmetric measure (퐷퐾퐿(푃| 푄 ≠ 퐷퐾퐿(푄||푃))

Suppose we have a distribution 푝(푥) and want to approximate it with 푞(푥):

퐷퐾퐿(푃| 푄 ; p: high, q: high 퐷퐾퐿(푄| 푃 ; p:low, q:low

min න 푝(log 푝 푥 − log 푞(푥)) min න 푞(log 푝 푥 − log 푞(푥)) Fundamentals of Media Processing, Deep Learning Cross-entropy

• 퐻 푃, 푄 = 퐻 푃 + 퐷퐾퐿(푃| 푄 = −퐸푥~푃 log 푄(푥)

⚫ The average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an “artificial” probability distribution Q, not true distribution P

⚫ Minimizing the cross-entropy with respect to Q is equivalent to minimizing the KL divergence

⚫ In classification problems, the commonly used cross entropy loss, measures the cross entropy between the empirical distribution of the labels (given the inputs) and the distribution predicted by the classifier

Fundamentals of Media Processing, Deep Learning Example Strategy 1 Correct probability distribution (P(x))

50% 25% 12.5%12.5%

Incorrect probability distribution (Q(x)) expected number of questions to guess the coin is 2. 25% 25% 25% 25% Strategy 2

−퐸푥~푃 log 푄 푥 = − ෍ 푃 log 푄 =

Thus, cross entropy for a given strategy is simply the expected number of questions to guess the color under that strategy. For a given setup, the better the strategy is, the lower the cross entropy is. The lowest cross entropy is that of the optimal strategy, which is just the entropy defined above. https://www.quora.com/Whats-an-intuitive-way-to-think-of-cross-entropy expected number of questions to guess the coin is 1.75<2. Fundamentals of Media Processing, Deep Learning Structured Probabilistic Models

◼ In machine learning, it is inefficient to represent an entire probabilistic distribution in a single function. To reduce the parameter, the function is often factorized ◼ Suppose 푎 influence 푏, and 푏 influence 푐, but 푎 and 푐 are independent; 푝 푎, 푏, 푐 = 푝 푎 푝 푏 푎 푝(푐|푏)

◼ We can describe these factorizations using graphs (graphical model) 풢: graph 풞푖: clique 휙 푖 (풞푖): factor

• Undirected model (factor graph) • Directed model 1 푝 푎, 푏, 푐, 푑, 푒 = 푝 푎 푝 푏 푎 푝 푐 푎, 푏 푝 푑 푏 푝(푒|푐) 푝 푎, 푏, 푐, 푑, 푒 = 휙 1 a, b, c 휙 2 푏, 푑 휙 3 (푐, 푒) 푍 Fundamentals of Media Processing, Deep Learning Markov Random Field (MRF)

◼ In computer vision algorithm, the most common graphical model may be Markov Random Filed (MRF), whose log- likelihood can be described using local neighborhood interaction (or penalty) terms. ◼ MRF models can be defined over discrete variables, such as image labels (e.g., image restoration)

Likelihood term penalty term (pairwise smoothness) 퐸(풙, 풚) = 퐸푑(풙, 풚) + 퐸푝(풙)

퐸푝 풙 = ෍ 푉푖,푗,푘,푙(푓 푖, 푗 , 푗(푘, 푙)) 푖,푗 , 푘,푙 ∈풩

풩4 and 풩8 neighborhood system

Fundamentals of Media Processing, Deep Learning STEREO MATCHING

STEREO as PIXEL-LABELING PROBLEM

Assign a disparity label to each pixel

LEFT RIGHT

푑푖푠푝푎푟푖푡푦 푙푎푏푒푙 Epipolar INPUT

LABEL

Fundamentals of Media Processing, Deep Learning CLASSIC: WINNER-TAKES-ALL

Assign labels with lowest stereo matching cost

Left image Right image Cost map

Epipolar Line

퐶푖,푙 Cost map 퐶푖,푙 = 퐼푖+푙 − 퐼푖 - Left and right color difference in RGB space - Other matching cost: SSD, SAD, NCC,… A series of “block matching”

Fundamentals of Media Processing, Deep Learning MRF Modeling For Stereo Matching

◼ Markov Random Field (MRF) - Many computer vision problems were formulated on the “graph” - MRF: the graph structure where each node is only affected by its “neighbor”

Pairwise term for “smoothness prior” Image is a graph of uniform grid nodes

Unary Pairwise

min ෍ 퐶푖,푙 + ෍ ෍ 푆푙 ,푙 푙 푖 푗 푖 푖 푗∈푁(푖) 퐶 = 퐼 − 퐼 푖,푙 푖+푙 푖 S푙푖,푙푗 = 푤푖푗 푙푖 − 푙푗

Left-right consistency Label smoothness

Why MRF?: Convenient optimization methods are available - Belief Propagation (BP), local optimum (Freeman2000, Sun2003) - Graph Cuts (GC), global optimum (Kolmogorov and Zabih2001)

Fundamentals of Media Processing, Deep Learning WTA vs. MRF

Cost map (original) disparity of min-cost Left image

⊖ Cost map (smoothed disparity of min-cost Right image By MRF)

Fundamentals of Media Processing, Deep Learning Computer Vision LOVES MRF

Foreground / Background segmentation Denoising (0-255) (Boykov2006) (Szeliski2008)

Accuracy is most important! Better cost functions and optimization techniques!

Multi-label MRF High-order regularization Semantic segmentation (He2004)

Fundamentals of Media Processing, Deep Learning Conditional Random Field (CRF)

◼ In classical Bayes model, prior p(x) is independent of the observation y. 푝(풙|풚) ∝ 푝(풚|풙)푝(풙) ◼ However, it is often helpful to update the prior probability based on the observation; the pairwise term depends on the y as well as x

퐸(풙|풚) = 퐸푑(풙, 풚) + 퐸푠(풙, 풚) = ෍ 푉푝 풙푝, 풚 + ෍ 푉푝,푞(풙푝, 풙푞, 풚) 푝 푝,푞

Fundamentals of Media Processing, Deep Learning Numerical Computation

Fundamentals of Media Processing, Deep Learning Numerical Concerns for Implementations of Deep Learning Algorithms ◼ Algorithms are often specified in terms of real numbers; real numbers cannot be implemented in a finite computer • Does the algorithm work when implemented with a finite number of bits? ◼ Do small changes in the input to a function cause large changes to an output? • Rounding errors, noise, measurement errors can cause large changes • Iterative search for best input is difficult

Example of Underflow Example of Overflow Fundamentals of Media Processing, Deep Learning Poor Conditioning

◼ Conditioning refers to how rapidly a function changes with respect to small changes in its inputs

◼ We can evaluate the conditioning by a conditioning number ⚫ The sensitivity is an intrinsic property of a function, not of computational error ⚫ For example, condition number for 푓(풙) = 퐴−1풙, where A is a positive semidefinite matrix, is

휆푖 max ; where 휆푠 are eigenvalue of A 푖,푗 휆푗

Fundamentals of Media Processing, Deep Learning Gradient-Based Optimization

• Objective function: the function we want to minimize • May also call it criterion, cost function, loss function, error function • 풙∗ = argmin푓(풙) • The derivative of 푓(푥) is denoted as 푓′(푥) or 푑푓/푑푥 • The gradient descent is the technique to reduce 푓(푥) by moving 푥 in small steps with the opposite sign of the derivative • Stationary points: local minima or maxima 푓′ 푥 = 0

Gradient descent Local minima and global minimum Fundamentals of Media Processing, Deep Learning Partial/Directional Derivatives for multiple inputs

휕푓 휕푓 휕푓 푧 = 푓(풙) Partial Derivatives ∇풙푓 풙 = , ⋯ , 휕푥푖 휕푥1 휕푥푚 ◼ The directional derivatives in direction 푢 is the slope of the function 푓 in direction 푢 To find the “steepest” direction,

푇 min 푢 ∇푥푓 풙 = min 풖 2 ∇푥푓 풙 2 cos 휃 풖 풖 ≅ min cos 휃 휃 Opposite direction 풖 ∇푥푓(풙) 푡+1 푡 푡 풙 = 풙 − 휖∇풙푓(풙 ) • Gradient descent for multiple inputs • 휖 (learning rate) is fixed or adaptively https://en.wikipedia.org/wiki/Directional_derivative selected (line search) Fundamentals of Media Processing, Deep Learning Beyond the Gradient: Jacobian and Hessian Matrices

휕푓 휕푓 1 , ⋯ , 1 휕푥1 휕푥푚 푚 푛 풇: ℝ → ℝ ∇풙푓 풙 = ⋮ Jacobian matrix 휕푓 휕푓 푛 , ⋯ , 푛 휕푥1 휕푥푚 휕2 퐻 푓 풙 푖푗 = 푓(풙) Hessian matrix 휕푥푖휕푥푗

◼ When the function is continuous, 퐻 푓 풙 푖푗 = 퐻 푓 풙 푗푖 ◼ A real symmetric Hessian matrix has Eigendecomposition ⚫ When the Hessian is positive semidefinite, the point is local minimum ⚫ When the Hessian is negative semidefinite, the point is local maximum ⚫ Otherwise, the point is a saddle point

Fundamentals of Media Processing, Deep Learning Beyond the Gradient: Jacobian and Hessian Matrices ◼ The second derivative in a specific direction represented by a unit vector 풅 is 풅푇퐻풅

푇 1 푇 푓 풙 ≈ 푓 풙 0 + 풙 − 풙 0 품 + 풙 − 풙(0) 퐻(풙 − 풙(0)) 2 • 풙 0 is the current point, 품 is the gradient and H is the Hessian at 풙 0 ◼ Then new point 풙 will be given by 풙(0) − 휖품 1 푓 풙(ퟎ) − 휖품 ≈ 푓 풙 0 − 휖품푇품 + 휖2품푇퐻품 2 ◼ When 품푇퐻품 is positive, solving for the optimal learning rate that decreases the function is 푔푇푔 휖∗ = 푔푇퐻푔 Fundamentals of Media Processing, Deep Learning Newton’s Method (Second-Order Algorithm)

◼ In Gradient descent, the step size must be small enough ◼ Newton’s method is based on using 1st-order or 2nd-order Tyler Expansion 푇 1 푇 푓 풙 ≈ 푓 푥 0 + 푥 − 푥 0 푔 + 푥 − 푥(0) 퐻(푥 − 푥(0)) 2 - The critical point (∇푓 풙∗ = ퟎ) is 풙∗ = 푥(0) − 푔−1푓 1푠푡 or 푥 0 − 퐻−1푔(2푛푑)

• When f is a positive definite quadratic function, Newton’s method once to jump to the minimum of the function directly. • When f is not truly quadratic but can be locally approximated as a positive definite quadratic, Newton’s method consists of applying multiple jumping • Jumping to the minimum of the approximation can reach the critical point much faster than gradient descent would.

Fundamentals of Media Processing, Deep Learning Example (For univariate function: 1st order case)

Fundamentals of Media Processing, Deep Learning Constrained Optimization

◼ Constraint optimization problem is generally written as

min 푓 s.t. 푔푖 풙 = 0, ℎ푖 풙 ≤ 0 풙 equality constraint inequality constraint

◼ Karush-Kuhn-Tucker (KKT) Multiplier (Generalization of the Lagrange Multipliers)

min max max 퐿 푥, 휆푗, 휇푗 푥 휆푖 휇푖≥0 = min max max 푓 푥 + σ휆푖푔푖 푥 + σ휇푖ℎ푖(푥) 푥 휆푖 훼≥0 Generalized Lagrangian

Fundamentals of Media Processing, Deep Learning Karush-Kuhn-Tucker Condition

◼ KKT Conditions: For a point to be optimal,

1. The gradient of the generalized Lagrangian is zero 2. All constraints on both x and the KKT multipliers are satisfied 3. The inequality constraints exhibit “complementary slackness”: 훼⨀ℎ 푥 = 0

Fundamentals of Media Processing, Deep Learning Example: Linear Least Squares

◼ Consider an unconstrained problem of: 1 푓 풙 = 퐴풙 − 풃 2 2 2 푇 푇 푇 ∇풙푓 풙 = 퐴 퐴풙 − 풃 = 퐴 퐴풙 − 퐴 풃 풙푛+1 ← 풙푛 − 휖(퐴푇퐴풙푛 − 퐴푇풃) (Gradient Decent)

◼ If subjected to 풙푇풙 ≤ ퟏ minL 풙, 휆 L 풙, 휆 = 푓 풙 − 휆(풙푇풙 − 1) 푥,휆≥0 푇 푇 푇 −1 푇 ∇풙퐿 풙, 휆 = 퐴 퐴풙 − 퐴 풃 + 2휆풙 = 0 풙 = 퐴 퐴 + 2휆퐼 퐴 풃 ◼ The magnitude of 휆 must obey the constraint:

- Update 휆 (≥ 0) until ∇휆퐿 풙, 휆 becomes zero 푇 ∇휆퐿 풙, 휆 = 풙 풙 − 1

Fundamentals of Media Processing, Deep Learning Class material is available at https://satoshi-ikehata.github.io/mediaprocessing.html

Fundamentals of Media Processing, Deep Learning