<<

Optimal mass transport as a distance measure between images

Axel Ringh1

1Department of Mathematics, KTH Royal Institute of Technology, Stockholm, Sweden.

21st of June 2018 INRIA, Sophia-Antipolis, France Acknowledgements

This is based on joint work with Johan Karlsson1. [1] J. Karlsson, and A. Ringh. Generalized Sinkhorn iterations for regularizing inverse problems using optimal mass transport. SIAM Journal on Sciences, 10(4), 1935-1962, 2017.

I acknowledge financial support from Swedish Research Council (VR) Swedish Foundation for Strategic Research (SSF)

Code: https://github.com/aringh/Generalized-Sinkhorn-and-tomography The code is based on ODL: https://github.com/odlgroup/odl

1Department of Mathematics, KTH Royal Institute of Technology, Stockholm, Sweden

2 / 24 Outline

Background Inverse problems Optimal mass transport Sinkhorn iterations - solving discretized optimal transport problems Sinkhorn iterations as dual coordinate ascent Inverse problems with optimal mass transport priors Example in computerized tomography

3 / 24 Problems of interest are ill-posed inverse problems: a solution might not exist, the solution might not be unique, the solution does not depend continuously on data. Simply put: A−1 does not exist as a continuous bijection!

Comes down to: find approximate inverse A† so that g = A(f ) + ’noise’ =⇒ A†(g) ≈ f .

Background Inverse problems

Consider the problem of recovering f ∈ X from data g ∈ Y , given by g = A(f ) + ’noise’

Notation: X is called the reconstruction space. Y is called the data space. A : X → Y is the forward operator. A∗ : Y → X denotes the adjoint operator

4 / 24 Background Inverse problems

Consider the problem of recovering f ∈ X from data g ∈ Y , given by g = A(f ) + ’noise’

Notation: X is called the reconstruction space. Y is called the data space. A : X → Y is the forward operator. A∗ : Y → X denotes the adjoint operator

Problems of interest are ill-posed inverse problems: a solution might not exist, the solution might not be unique, the solution does not depend continuously on data. Simply put: A−1 does not exist as a continuous bijection!

Comes down to: find approximate inverse A† so that g = A(f ) + ’noise’ =⇒ A†(g) ≈ f . 4 / 24 Background Variational regularization

A common technique to solve ill-posed inverse problems is to use variational regularization:

arg min G(A(f ), g) + λF(f ) f ∈X

G : Y × Y → R, data discrepancy functional. F : X → R, regularization functional. λ is the regularization parameter. Controls trade-off between data matching and regularization.

Common example in imaging is total variation regularization: 2 G(h, g) = kh − gk2,

F(f ) = k∇f k1. If A is linear this is a convex problem!

5 / 24 One way: consider arg min G(A(f ), g) + λF(f ) + γH(f˜, f ) f ∈X f˜ is prior/template H defines “closeness” to f˜. What is a good choice for H?

Scenarios where potentially of interest. incomplete measurements, e.g. limited angle tomography. spatiotemporal imaging: T data is a time-series of data sets: {gt }t=0. For each set, the underlying image has undergone a deformation. † each data set gt normally “contains less information”: A (gt ) is a poor reconstruction. Approach: solve coupled inverse problems T T X h i X arg min G(A(fj ), gj ) + λF(fj ) + γH(fj−1, fj ) f ,...,f ∈X 0 T j=0 j=1

Background Incorporating prior information in variational schemes

How can one incorporate prior information in such a scheme?

6 / 24 Scenarios where potentially of interest. incomplete measurements, e.g. limited angle tomography. spatiotemporal imaging: T data is a time-series of data sets: {gt }t=0. For each set, the underlying image has undergone a deformation. † each data set gt normally “contains less information”: A (gt ) is a poor reconstruction. Approach: solve coupled inverse problems T T X h i X arg min G(A(fj ), gj ) + λF(fj ) + γH(fj−1, fj ) f ,...,f ∈X 0 T j=0 j=1

Background Incorporating prior information in variational schemes

How can one incorporate prior information in such a scheme?

One way: consider arg min G(A(f ), g) + λF(f ) + γH(f˜, f ) f ∈X f˜ is prior/template H defines “closeness” to f˜. What is a good choice for H?

6 / 24 Background Incorporating prior information in variational schemes

How can one incorporate prior information in such a scheme?

One way: consider arg min G(A(f ), g) + λF(f ) + γH(f˜, f ) f ∈X f˜ is prior/template H defines “closeness” to f˜. What is a good choice for H?

Scenarios where potentially of interest. incomplete measurements, e.g. limited angle tomography. spatiotemporal imaging: T data is a time-series of data sets: {gt }t=0. For each set, the underlying image has undergone a deformation. † each data set gt normally “contains less information”: A (gt ) is a poor reconstruction. Approach: solve coupled inverse problems T T X h i X arg min G(A(fj ), gj ) + λF(fj ) + γH(fj−1, fj ) f ,...,f ∈X 0 T j=0 j=1 6 / 24 One suggestion: measure it pointwise, e.g., using an Lp metric

Z 1/p p kf0 − f1kp = |f0(x) − f1(x)| dx . D

Draw-backs: for example unsensitive to shifts.

Example: It gives the same distance from f0 to f1 and f2: kf0 − f1k1 = kf0 − f2k1 = 8.

Background Measuring distances between functions: the Lp metrics

Given two functions f0(x) and f1(x), what is a suitable way to measure the distance between the two?

7 / 24 Draw-backs: for example unsensitive to shifts.

Example: It gives the same distance from f0 to f1 and f2: kf0 − f1k1 = kf0 − f2k1 = 8.

Background Measuring distances between functions: the Lp metrics

Given two functions f0(x) and f1(x), what is a suitable way to measure the distance between the two?

One suggestion: measure it pointwise, e.g., using an Lp metric

Z 1/p p kf0 − f1kp = |f0(x) − f1(x)| dx . D

7 / 24 Background Measuring distances between functions: the Lp metrics

Given two functions f0(x) and f1(x), what is a suitable way to measure the distance between the two?

One suggestion: measure it pointwise, e.g., using an Lp metric

Z 1/p p kf0 − f1kp = |f0(x) − f1(x)| dx . D

Draw-backs: for example unsensitive to shifts.

Example: It gives the same distance from f0 to f1 and f2: kf0 − f1k1 = kf0 − f2k1 = 8.

7 / 24 where φ is mass preserving from f0 to f1: Z Z f1(x)dx = f0(x)dx for all A ⊂ X . x∈A φ(x)∈A

Let c(x0, x1): X × X → R+ describes the cost for transporting a unit mass from location x0 to x1. 3 3

2.5 2.5 Given two functions f0, f1 : X → +, find the function φ : X → X R 2 2 minimizing the transport cost Nonconvex problem! 1.5 1.5

Z 1 1 c(x, φ(x))f (x)dx 0 0.5 0.5 X 0 0 0 5 10 0 0.5 1 1.5 2 2.5 3 10

8

6

4

2

0 0 0.5 1 1.5 2 2.5 3

Background Optimal mass transport - Monge formulation

Gaspard Monge: formulated optimal mass transport 1781. Optimal transport of soil for construction of forts and roads.

Gaspard Monge

8 / 24 where φ is mass preserving map from f0 to f1: Z Z f1(x)dx = f0(x)dx for all A ⊂ X . x∈A φ(x)∈A

Let c(x0, x1): X × X → R+ describes the cost for transporting a unit mass from location x0 to x1. 3 3

2.5 2.5 Given two functions f0, f1 : X → +, find the function φ : X → X R 2 2 minimizing the transport cost Nonconvex problem! 1.5 1.5

Z 1 1 c(x, φ(x))f (x)dx 0 0.5 0.5 X 0 0 0 5 10 0 0.5 1 1.5 2 2.5 3 10

8

6

4

2

0 0 0.5 1 1.5 2 2.5 3

Background Optimal mass transport - Monge formulation

Gaspard Monge: formulated optimal mass transport 1781. Optimal transport of soil for construction of forts and roads.

Gaspard Monge

8 / 24 where φ is mass preserving map from f0 to f1: Z Z f1(x)dx = f0(x)dx for all A ⊂ X . x∈A φ(x)∈A

Let c(x0, x1): X × X → R+ describes the cost for transporting a unit mass from location x0 to x1. 3 3

2.5 2.5 Given two functions f0, f1 : X → +, find the function φ : X → X R 2 2 minimizing the transport cost Nonconvex problem! 1.5 1.5

Z 1 1 c(x, φ(x))f (x)dx 0 0.5 0.5 X 0 0 0 5 10 0 0.5 1 1.5 2 2.5 3 10

8

6

4

2

0 0 0.5 1 1.5 2 2.5 3

Background Optimal mass transport - Monge formulation

Gaspard Monge: formulated optimal mass transport 1781. Optimal transport of soil for construction of forts and roads.

Gaspard Monge

8 / 24 Nonconvex problem!

where φ is mass preserving map from f0 to f1: Z Z f1(x)dx = f0(x)dx for all A ⊂ X . x∈A φ(x)∈A

Background Optimal mass transport - Monge formulation

Gaspard Monge: formulated optimal mass transport 1781. Optimal transport of soil for construction of forts and roads.

Gaspard Monge Let c(x0, x1): X × X → R+ describes the cost for transporting a unit mass from location x0 to x1. 3 3

2.5 2.5 Given two functions f0, f1 : X → +, find the function φ : X → X R 2 2 minimizing the transport cost 1.5 1.5

Z 1 1 c(x, φ(x))f (x)dx 0 0.5 0.5 X 0 0 0 5 10 0 0.5 1 1.5 2 2.5 3 10

8

6

4

2

0 0 0.5 1 1.5 2 2.5 3

8 / 24 Nonconvex problem!

Background Optimal mass transport - Monge formulation

Gaspard Monge: formulated optimal mass transport 1781. Optimal transport of soil for construction of forts and roads.

Gaspard Monge Let c(x0, x1): X × X → R+ describes the cost for transporting a unit mass from location x0 to x1. 3 3

2.5 2.5 Given two functions f0, f1 : X → +, find the function φ : X → X R 2 2 minimizing the transport cost 1.5 1.5

Z 1 1 c(x, φ(x))f (x)dx 0 0.5 0.5 X 0 0 0 5 10 0 0.5 1 1.5 2 2.5 3 10 where φ is mass preserving map from f0 to f1: 8 6 Z Z 4 2 f (x)dx = f (x)dx for all A ⊂ X . 0 1 0 0 0.5 1 1.5 2 2.5 3 x∈A φ(x)∈A

8 / 24 Background Optimal mass transport - Monge formulation

Gaspard Monge: formulated optimal mass transport 1781. Optimal transport of soil for construction of forts and roads.

Gaspard Monge Let c(x0, x1): X × X → R+ describes the cost for transporting a unit mass from location x0 to x1. 3 3

2.5 2.5 Given two functions f0, f1 : X → +, find the function φ : X → X R 2 2 minimizing the transport cost Nonconvex problem! 1.5 1.5

Z 1 1 c(x, φ(x))f (x)dx 0 0.5 0.5 X 0 0 0 5 10 0 0.5 1 1.5 2 2.5 3 10 where φ is mass preserving map from f0 to f1: 8 6 Z Z 4 2 f (x)dx = f (x)dx for all A ⊂ X . 0 1 0 0 0.5 1 1.5 2 2.5 3 x∈A φ(x)∈A

8 / 24 Z s.t. f0(x0) = M(x0, x1)dx1, x0 ∈ X X Z f1(x1) = M(x0, x1)dx0, x1 ∈ X . X

Again, let c(x0, x1) denote the cost of transporting a unit mass from the point x0 to the point x1.

Given two functions f0, f1 : X → R+, find a transport M : X ×X → R+, where M(x0, x1) is the amount of mass moved between x0 to x1. Look for M that minimizes transportation cost: Z min c(x0, x1)M(x0, x1)dx0dx1 M≥0 X ×X

Background Optimal mass transport - Kantorovich formulation

Leonid Kantorovich: convex formulation and duality theory 1942. Nobel Memorial Prize 1975 in Economics for “contributions to the theory of optimum allocation of resources.”

Leonid Kantorovich

9 / 24 Z s.t. f0(x0) = M(x0, x1)dx1, x0 ∈ X X Z f1(x1) = M(x0, x1)dx0, x1 ∈ X . X

Look for M that minimizes transportation cost: Z min c(x0, x1)M(x0, x1)dx0dx1 M≥0 X ×X

Background Optimal mass transport - Kantorovich formulation

Leonid Kantorovich: convex formulation and duality theory 1942. Nobel Memorial Prize 1975 in Economics for “contributions to the theory of optimum allocation of resources.”

Again, let c(x0, x1) denote the cost of transporting a unit mass from Leonid Kantorovich the point x0 to the point x1.

3 3 Given two functions f0, f1 : X → +, find a transport plan M : R 2.5 2.5 X ×X → +, where M(x0, x1) is the amount of mass moved between R 2 2 x0 to x1. 1.5 1.5

1 1

0.5 0.5

0 0 0 5 10 0 0.5 1 1.5 2 2.5 3 10

8

6

4

2

0 0 0.5 1 1.5 2 2.5 3

9 / 24 Z s.t. f0(x0) = M(x0, x1)dx1, x0 ∈ X X Z f1(x1) = M(x0, x1)dx0, x1 ∈ X . X

Look for M that minimizes transportation cost: Z min c(x0, x1)M(x0, x1)dx0dx1 M≥0 X ×X

Background Optimal mass transport - Kantorovich formulation

Leonid Kantorovich: convex formulation and duality theory 1942. Nobel Memorial Prize 1975 in Economics for “contributions to the theory of optimum allocation of resources.”

Again, let c(x0, x1) denote the cost of transporting a unit mass from Leonid Kantorovich the point x0 to the point x1. Transportation plan (M)

3 Given two functions f0, f1 : X → R+, find a transport plan M : 2.5 X ×X → R+, where M(x0, x1) is the amount of mass moved between 2 x0 to x1. 1.5

1

0.5

0 0 5 10 10

8

6

4

2

0 0 0.5 1 1.5 2 2.5 3

9 / 24 Z s.t. f0(x0) = M(x0, x1)dx1, x0 ∈ X X Z f1(x1) = M(x0, x1)dx0, x1 ∈ X . X

Background Optimal mass transport - Kantorovich formulation

Leonid Kantorovich: convex formulation and duality theory 1942. Nobel Memorial Prize 1975 in Economics for “contributions to the theory of optimum allocation of resources.”

Again, let c(x0, x1) denote the cost of transporting a unit mass from Leonid Kantorovich the point x0 to the point x1. Transportation plan (M)

3 Given two functions f0, f1 : X → R+, find a transport plan M : 2.5 X ×X → R+, where M(x0, x1) is the amount of mass moved between 2 x0 to x1. Look for M that minimizes transportation cost: 1.5

Z 1 min c(x , x )M(x , x )dx dx 0 1 0 1 0 1 0.5 M≥0 X ×X 0 0 5 10 10

8

6

4

2

0 0 0.5 1 1.5 2 2.5 3

9 / 24 Background Optimal mass transport - Kantorovich formulation

Leonid Kantorovich: convex formulation and duality theory 1942. Nobel Memorial Prize 1975 in Economics for “contributions to the theory of optimum allocation of resources.”

Again, let c(x0, x1) denote the cost of transporting a unit mass from Leonid Kantorovich the point x0 to the point x1. Transportation plan (M)

3 Given two functions f0, f1 : X → R+, find a transport plan M : 2.5 X ×X → R+, where M(x0, x1) is the amount of mass moved between 2 x0 to x1. Look for M that minimizes transportation cost: 1.5

Z 1 min c(x , x )M(x , x )dx dx 0 1 0 1 0 1 0.5 M≥0 X ×X 0 Z 0 5 10 10 s.t. f0(x0) = M(x0, x1)dx1, x0 ∈ X 8 X 6 4 Z 2 0 f1(x1) = M(x0, x1)dx0, x1 ∈ X . 0 0.5 1 1.5 2 2.5 3 X 9 / 24 Background Measuring distances between functions: optimal mass transport

Define a distance between two functions f0(x) and f1(x) using optimal transport  Z  min c(x0, x1)M(x0, x1)dx0dx1  M≥0 X ×X  Z T (f0, f1) := s.t. f0(x0) = M(x0, x1)dx1, x0 ∈ X  ZX   f1(x1) = M(x0, x1)dx0, x1 ∈ X . X

p 1/p If d(x, y) metric on X and c(x, y) = d(x, y) , then T (f0, f1) is a metric on the space of measures.

10 / 24 Background Measuring distances between functions: optimal mass transport

Define a distance between two functions f0(x) and f1(x) using optimal transport  Z  min c(x0, x1)M(x0, x1)dx0dx1  M≥0 X ×X  Z T (f0, f1) := s.t. f0(x0) = M(x0, x1)dx1, x0 ∈ X  ZX   f1(x1) = M(x0, x1)dx0, x1 ∈ X . X

p 1/p If d(x, y) metric on X and c(x, y) = d(x, y) , then T (f0, f1) is a metric on the space of measures.

2 Example revisited: let c(x, y) = kx − yk2 √ 2 T (f0, f1) = 4 · ( 2) = 8, 2 T (f0, f2) = 4 · 5 = 100

This indicates that optimal transport is a more natural distance between two images than Lp, at least if one is a deformation of the other.

10 / 24 n m n×m A linear programming problem: for vectors f0 ∈ R and f1 ∈ R and a cost matrix C = [cij ] ∈ R , where cij defines the transportation cost between pixels xi and xj , find the transportation plan n×m M = [mij ] ∈ R , mij is the mass transported between pixels xi and xj , such that n m X X min cij mij m ≥0 T ij i=1 j=1 min trace(C M) M≥0 m X subject to M1m = f0 subject to mij = f0(i), i = 1,..., n ⇐⇒ T j=1 M 1n = f1 n X mij = f1(j), j = 1,..., m i=1

Number of variables is n · m. To compare the distance between to images of size n = m = 256 × 256:

2562×2562 9 M ∈ R =⇒ approximately 4 · 10 variables. Prohibitively large!

Background Optimal mass transport - discrete formulation

How to solve the optimal transport problem? Here: “discretize then optimize”

11 / 24 min trace(C T M) M≥0

subject to M1m = f0 ⇐⇒ T M 1n = f1

Number of variables is n · m. To compare the distance between to images of size n = m = 256 × 256:

2562×2562 9 M ∈ R =⇒ approximately 4 · 10 variables. Prohibitively large!

Background Optimal mass transport - discrete formulation

How to solve the optimal transport problem? Here: “discretize then optimize”

n m n×m A linear programming problem: for vectors f0 ∈ R and f1 ∈ R and a cost matrix C = [cij ] ∈ R , where cij defines the transportation cost between pixels xi and xj , find the transportation plan n×m M = [mij ] ∈ R , mij is the mass transported between pixels xi and xj , such that n m X X min cij mij m ≥0 ij i=1 j=1 m X subject to mij = f0(i), i = 1,..., n j=1 n X mij = f1(j), j = 1,..., m i=1

11 / 24 Number of variables is n · m. To compare the distance between to images of size n = m = 256 × 256:

2562×2562 9 M ∈ R =⇒ approximately 4 · 10 variables. Prohibitively large!

Background Optimal mass transport - discrete formulation

How to solve the optimal transport problem? Here: “discretize then optimize”

n m n×m A linear programming problem: for vectors f0 ∈ R and f1 ∈ R and a cost matrix C = [cij ] ∈ R , where cij defines the transportation cost between pixels xi and xj , find the transportation plan n×m M = [mij ] ∈ R , mij is the mass transported between pixels xi and xj , such that n m X X min cij mij m ≥0 T ij i=1 j=1 min trace(C M) M≥0 m X subject to M1m = f0 subject to mij = f0(i), i = 1,..., n ⇐⇒ T j=1 M 1n = f1 n X mij = f1(j), j = 1,..., m i=1

11 / 24 Background Optimal mass transport - discrete formulation

How to solve the optimal transport problem? Here: “discretize then optimize”

n m n×m A linear programming problem: for vectors f0 ∈ R and f1 ∈ R and a cost matrix C = [cij ] ∈ R , where cij defines the transportation cost between pixels xi and xj , find the transportation plan n×m M = [mij ] ∈ R , mij is the mass transported between pixels xi and xj , such that n m X X min cij mij m ≥0 T ij i=1 j=1 min trace(C M) M≥0 m X subject to M1m = f0 subject to mij = f0(i), i = 1,..., n ⇐⇒ T j=1 M 1n = f1 n X mij = f1(j), j = 1,..., m i=1

Number of variables is n · m. To compare the distance between to images of size n = m = 256 × 256:

2562×2562 9 M ∈ R =⇒ approximately 4 · 10 variables. Prohibitively large!

11 / 24 Using Lagrangian relaxation gives T T T T L(M, λ0, λ1) = trace(C M) + D(M) + λ0 (f0 − M1) + λ1 (f1 − M 1). T The solution M = diag(eλ0 /)Kdiag(eλ1/) is specified by n + m unknowns! Given dual variables λ0, λ1, the minimum mij is

∂L(M, λ0, λ1) 0 = = cij +  log(mij ) − λ0(i) − λ1(j) ∂mij

λ0(i)/ −cij / λ1(j)/ Expressed as mij = e e e T M = diag(eλ0 /)Kdiag(eλ1/)

where K = exp(−C/). Here and in what follows exp(·), log(·), ./, denotes the elementwise function.

Background Optimal mass transport - Sinkhorn iterations

Original problem too computationally demanding. P Solution: introduce an entropy barrier/regularization term D(M) = i,j (mij log(mij ) − mij + 1) [1], T T(f0, f1) := min trace(C M) + D(M) M≥0

subject to f0 = M1 T f1 = M 1.

[1] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013. 12 / 24 T The solution M = diag(eλ0 /)Kdiag(eλ1/) is specified by n + m unknowns! Given dual variables λ0, λ1, the minimum mij is

∂L(M, λ0, λ1) 0 = = cij +  log(mij ) − λ0(i) − λ1(j) ∂mij

λ0(i)/ −cij / λ1(j)/ Expressed as mij = e e e T M = diag(eλ0 /)Kdiag(eλ1/)

where K = exp(−C/). Here and in what follows exp(·), log(·), ./, denotes the elementwise function.

Background Optimal mass transport - Sinkhorn iterations

Original problem too computationally demanding. P Solution: introduce an entropy barrier/regularization term D(M) = i,j (mij log(mij ) − mij + 1) [1], T T(f0, f1) := min trace(C M) + D(M) M≥0

subject to f0 = M1 T f1 = M 1. Using Lagrangian relaxation gives T T T T L(M, λ0, λ1) = trace(C M) + D(M) + λ0 (f0 − M1) + λ1 (f1 − M 1).

[1] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013. 12 / 24 T The solution M = diag(eλ0 /)Kdiag(eλ1/) is specified by n + m unknowns!

λ0(i)/ −cij / λ1(j)/ Expressed as mij = e e e T M = diag(eλ0 /)Kdiag(eλ1/)

where K = exp(−C/). Here and in what follows exp(·), log(·), ./, denotes the elementwise function.

Background Optimal mass transport - Sinkhorn iterations

Original problem too computationally demanding. P Solution: introduce an entropy barrier/regularization term D(M) = i,j (mij log(mij ) − mij + 1) [1], T T(f0, f1) := min trace(C M) + D(M) M≥0

subject to f0 = M1 T f1 = M 1. Using Lagrangian relaxation gives T T T T L(M, λ0, λ1) = trace(C M) + D(M) + λ0 (f0 − M1) + λ1 (f1 − M 1).

Given dual variables λ0, λ1, the minimum mij is

∂L(M, λ0, λ1) 0 = = cij +  log(mij ) − λ0(i) − λ1(j) ∂mij

[1] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013. 12 / 24 T The solution M = diag(eλ0 /)Kdiag(eλ1/) is specified by n + m unknowns!

Background Optimal mass transport - Sinkhorn iterations

Original problem too computationally demanding. P Solution: introduce an entropy barrier/regularization term D(M) = i,j (mij log(mij ) − mij + 1) [1], T T(f0, f1) := min trace(C M) + D(M) M≥0

subject to f0 = M1 T f1 = M 1. Using Lagrangian relaxation gives T T T T L(M, λ0, λ1) = trace(C M) + D(M) + λ0 (f0 − M1) + λ1 (f1 − M 1).

Given dual variables λ0, λ1, the minimum mij is

∂L(M, λ0, λ1) 0 = = cij +  log(mij ) − λ0(i) − λ1(j) ∂mij

λ0(i)/ −cij / λ1(j)/ Expressed as mij = e e e T M = diag(eλ0 /)Kdiag(eλ1/)

where K = exp(−C/). Here and in what follows exp(·), log(·), ./, denotes the elementwise function. [1] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013. 12 / 24 Background Optimal mass transport - Sinkhorn iterations

Original problem too computationally demanding. P Solution: introduce an entropy barrier/regularization term D(M) = i,j (mij log(mij ) − mij + 1) [1], T T(f0, f1) := min trace(C M) + D(M) M≥0

subject to f0 = M1 T f1 = M 1. Using Lagrangian relaxation gives T T T T L(M, λ0, λ1) = trace(C M) + D(M) + λ0 (f0 − M1) + λ1 (f1 − M 1). T The solution M = diag(eλ0 /)Kdiag(eλ1/) is specified by n + m unknowns! Given dual variables λ0, λ1, the minimum mij is

∂L(M, λ0, λ1) 0 = = cij +  log(mij ) − λ0(i) − λ1(j) ∂mij

λ0(i)/ −cij / λ1(j)/ Expressed as mij = e e e T M = diag(eλ0 /)Kdiag(eλ1/)

where K = exp(−C/). Here and in what follows exp(·), log(·), ./, denotes the elementwise function. [1] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013. 12 / 24 Let u0 = exp(λ0/), u1 = exp(λ1/). The optimal solution M = diag(u0)Kdiag(u1) needs to satisfy T diag(u0)Kdiag(u1)1 = f0 and diag(u1)K diag(u0)1 = f1.

Theorem (Sinkhorn iterations [2])

For any matrix K with positive elements there are diagonal matrices diag(u0), diag(u1) such that M = diag(u0)Kdiag(u1) has prescribed row- and column-sums f0 and f1. The vectors u0 and u1 can be obtained by alternating marginalization:

u0 = f0./(Ku1) T u1 = f1./(K u0).

T Each iteration only requires the multiplications Ku1 and K u0. This is the bottle neck. Linear convergence rate. Thus highly computationally efficient, allowing for solving large problems.

[2] R. Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4), 402–405, 1967.

Background Optimal mass transport - Sinkhorn iterations

How to find the values of λ0 and λ1?

[1] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013.

13 / 24 Theorem (Sinkhorn iterations [2])

For any matrix K with positive elements there are diagonal matrices diag(u0), diag(u1) such that M = diag(u0)Kdiag(u1) has prescribed row- and column-sums f0 and f1. The vectors u0 and u1 can be obtained by alternating marginalization:

u0 = f0./(Ku1) T u1 = f1./(K u0).

T Each iteration only requires the multiplications Ku1 and K u0. This is the bottle neck. Linear convergence rate. Thus highly computationally efficient, allowing for solving large problems.

[2] R. Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4), 402–405, 1967.

Background Optimal mass transport - Sinkhorn iterations

How to find the values of λ0 and λ1?

Let u0 = exp(λ0/), u1 = exp(λ1/). The optimal solution M = diag(u0)Kdiag(u1) needs to satisfy T diag(u0)Kdiag(u1)1 = f0 and diag(u1)K diag(u0)1 = f1.

[1] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013.

13 / 24 Background Optimal mass transport - Sinkhorn iterations

How to find the values of λ0 and λ1?

Let u0 = exp(λ0/), u1 = exp(λ1/). The optimal solution M = diag(u0)Kdiag(u1) needs to satisfy T diag(u0)Kdiag(u1)1 = f0 and diag(u1)K diag(u0)1 = f1.

Theorem (Sinkhorn iterations [2])

For any matrix K with positive elements there are diagonal matrices diag(u0), diag(u1) such that M = diag(u0)Kdiag(u1) has prescribed row- and column-sums f0 and f1. The vectors u0 and u1 can be obtained by alternating marginalization:

u0 = f0./(Ku1) T u1 = f1./(K u0).

T Each iteration only requires the multiplications Ku1 and K u0. This is the bottle neck. Linear convergence rate. Thus highly computationally efficient, allowing for solving large problems. [1] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013. [2] R. Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4), 402–405, 1967. 13 / 24 Here we will introduce yet another interpretation: Use a dual formulation The Sinkhorn iteration corresponds to dual coordinate ascent This allows us to generalize Sinkhorn iterations Approach for addressing inverse problem with optimal transport term

Sinkhorn iterations as dual coordinate ascent

Several ways to motivate Sinkhorn iterations [1] Diagonal matrix scaling Bregman projections Dykstra’s algorithm

[1] J.D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyr´e.Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2), A1111-A1138, 2015.

14 / 24 Sinkhorn iterations as dual coordinate ascent

Several ways to motivate Sinkhorn iterations [1] Diagonal matrix scaling Bregman projections Dykstra’s algorithm

Here we will introduce yet another interpretation: Use a dual formulation The Sinkhorn iteration corresponds to dual coordinate ascent This allows us to generalize Sinkhorn iterations Approach for addressing inverse problem with optimal transport term

[1] J.D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyr´e.Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2), A1111-A1138, 2015.

14 / 24 The Lagrangian dual function: ∗ ϕ(u0, u1) := min L(M, u0, u1) = L(M , u0, u1) = ... M≥0 T T T =  log(u0) f0 +  log(u1) f1 − u0 Ku1 + nm.

The dual problem is thus max ϕ(u0, u1) u0,u1

Taking the gradient w.r.t u0 and putting it equal to zero gives

0 = f0./u0 − Ku1,

and w.r.t u1 gives T  T  0 = f1./u1 − u0 K . These are the Sinkhorn iterations!

Sinkhorn iterations as dual coordinate ascent

Lagrangian relaxation gave optimal form of the primal variable ∗ M = diag(u0)Kdiag(u1)

15 / 24 The dual problem is thus max ϕ(u0, u1) u0,u1

Taking the gradient w.r.t u0 and putting it equal to zero gives

0 = f0./u0 − Ku1,

and w.r.t u1 gives T  T  0 = f1./u1 − u0 K . These are the Sinkhorn iterations!

Sinkhorn iterations as dual coordinate ascent

Lagrangian relaxation gave optimal form of the primal variable ∗ M = diag(u0)Kdiag(u1) The Lagrangian dual function: ∗ ϕ(u0, u1) := min L(M, u0, u1) = L(M , u0, u1) = ... M≥0 T T T =  log(u0) f0 +  log(u1) f1 − u0 Ku1 + nm.

15 / 24 Taking the gradient w.r.t u0 and putting it equal to zero gives

0 = f0./u0 − Ku1,

and w.r.t u1 gives T  T  0 = f1./u1 − u0 K . These are the Sinkhorn iterations!

Sinkhorn iterations as dual coordinate ascent

Lagrangian relaxation gave optimal form of the primal variable ∗ M = diag(u0)Kdiag(u1) The Lagrangian dual function: ∗ ϕ(u0, u1) := min L(M, u0, u1) = L(M , u0, u1) = ... M≥0 T T T =  log(u0) f0 +  log(u1) f1 − u0 Ku1 + nm.

The dual problem is thus max ϕ(u0, u1) u0,u1

15 / 24 Sinkhorn iterations as dual coordinate ascent

Lagrangian relaxation gave optimal form of the primal variable ∗ M = diag(u0)Kdiag(u1) The Lagrangian dual function: ∗ ϕ(u0, u1) := min L(M, u0, u1) = L(M , u0, u1) = ... M≥0 T T T =  log(u0) f0 +  log(u1) f1 − u0 Ku1 + nm.

The dual problem is thus max ϕ(u0, u1) u0,u1

Taking the gradient w.r.t u0 and putting it equal to zero gives

0 = f0./u0 − Ku1,

and w.r.t u1 gives T  T  0 = f1./u1 − u0 K . These are the Sinkhorn iterations! 15 / 24 T = min trace(C M) + D(M) + g(f1) M≥0, f1 subject to f0 = M1 T f1 = M 1.

Can be solve by dual coordinate ascent

u0 = f0./(Ku1)

∗ 1 T 0 ∈ ∂g (− log(u1)) − K u0, u1 if the second inclusion can be solved efficiently.

Can be solvable when ∂g ∗(·) is component-wise. Example of such cases: ˜ g(·) = If˜(·) indicator function on {f } original optimal transport problem 2 g(·) = k · k2

Sinkhorn iterations as dual coordinate ascent

For g proper, convex and lower semicontinuous, we can now consider problems of the form

min T(f0, f1) + g(f1) f1

16 / 24 Can be solve by dual coordinate ascent

u0 = f0./(Ku1)

∗ 1 T 0 ∈ ∂g (− log(u1)) − K u0, u1 if the second inclusion can be solved efficiently.

Can be solvable when ∂g ∗(·) is component-wise. Example of such cases: ˜ g(·) = If˜(·) indicator function on {f } original optimal transport problem 2 g(·) = k · k2

Sinkhorn iterations as dual coordinate ascent

For g proper, convex and lower semicontinuous, we can now consider problems of the form

T min T(f0, f1) + g(f1) = min trace(C M) + D(M) + g(f1) f1 M≥0, f1 subject to f0 = M1 T f1 = M 1.

16 / 24 Sinkhorn iterations as dual coordinate ascent

For g proper, convex and lower semicontinuous, we can now consider problems of the form

T min T(f0, f1) + g(f1) = min trace(C M) + D(M) + g(f1) f1 M≥0, f1 subject to f0 = M1 T f1 = M 1.

Can be solve by dual coordinate ascent

u0 = f0./(Ku1)

∗ 1 T 0 ∈ ∂g (− log(u1)) − K u0, u1 if the second inclusion can be solved efficiently.

Can be solvable when ∂g ∗(·) is component-wise. Example of such cases: ˜ g(·) = If˜(·) indicator function on {f } original optimal transport problem 2 g(·) = k · k2

16 / 24 + γT(f0, f1)

Prior f0

Large convex optimization problem with several terms, variable splitting common: ADMM, primal-dual hybrid gradient algorithm, primal-dual Douglas-Rachford algorithm.

Common tool in these algorithms: the proximal operator of the involved functions F.

σ 1 2 ProxF (h) = argmin F(f ) + kf − hk2. f 2σ

[1] R.T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14(5), 877-898, 1976.

Inverse problems with optimal mass transport priors

Consider the inverse problems

min k∇f1k1 f1≥0

subject to kAf1 − gk2 ≤ κ.

TV-regularization term: k∇f1k1

Forward model A, data g, and data mismatch term: kAf1 − gk2

17 / 24 + γT(f0, f1)

Large convex optimization problem with several terms, variable splitting common: ADMM, primal-dual hybrid gradient algorithm, primal-dual Douglas-Rachford algorithm.

Common tool in these algorithms: the proximal operator of the involved functions F.

σ 1 2 ProxF (h) = argmin F(f ) + kf − hk2. f 2σ

[1] R.T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14(5), 877-898, 1976.

Inverse problems with optimal mass transport priors

Consider the inverse problems

min k∇f1k1 + “f1 close to f0” f1≥0

subject to kAf1 − gk2 ≤ κ.

TV-regularization term: k∇f1k1

Forward model A, data g, and data mismatch term: kAf1 − gk2

Prior f0

17 / 24 Large convex optimization problem with several terms, variable splitting common: ADMM, primal-dual hybrid gradient algorithm, primal-dual Douglas-Rachford algorithm.

Common tool in these algorithms: the proximal operator of the involved functions F.

σ 1 2 ProxF (h) = argmin F(f ) + kf − hk2. f 2σ

[1] R.T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14(5), 877-898, 1976.

Inverse problems with optimal mass transport priors

Consider the inverse problems

min k∇f1k1 + γT(f0, f1) f1≥0

subject to kAf1 − gk2 ≤ κ.

TV-regularization term: k∇f1k1

Forward model A, data g, and data mismatch term: kAf1 − gk2

Prior f0

17 / 24 Inverse problems with optimal mass transport priors

Consider the inverse problems

min k∇f1k1 + γT(f0, f1) f1≥0

subject to kAf1 − gk2 ≤ κ.

TV-regularization term: k∇f1k1

Forward model A, data g, and data mismatch term: kAf1 − gk2

Prior f0

Large convex optimization problem with several terms, variable splitting common: ADMM, primal-dual hybrid gradient algorithm, primal-dual Douglas-Rachford algorithm.

Common tool in these algorithms: the proximal operator of the involved functions F.

σ 1 2 ProxF (h) = argmin F(f ) + kf − hk2. f 2σ

[1] R.T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14(5), 877-898, 1976.

17 / 24 Compare to 1 2 Using dual coordinate ascent, with g(·) = k · −hk2, 2σ 1 u0 = f0./(Ku1) we get the algorithm: T 2 u1 = f1./(K u0) 1 u0 = f0./(Ku1) 2 h h T   u1 = exp σ − ω σ + log K u0) + log(σ) Here ω denotes the (elementwise) Wright omega function, i.e., x = log(ω(x)) + ω(x). T Solved elementwise. Bottleneck is still computation of Ku1, K u0.

Theorem The algorithm is globally convergent, and with linear convergence rate.

Inverse problems with optimal mass transport priors Generalized Sinkhorn iterations

We want to compute the proximal of T(f0, ·), given by σ 1 2 ProxT(f0,·)(h) = argmin T(f0, f1) + kf1 − hk2. f1 2σ Thus we want to solve T 1 2 min trace(C M) + D(M) + kf1 − hk2 M≥0,f1 2σ

subject to f0 = M1 T f1 = M 1.

18 / 24 Theorem The algorithm is globally convergent, and with linear convergence rate.

Inverse problems with optimal mass transport priors Generalized Sinkhorn iterations

We want to compute the proximal of T(f0, ·), given by σ 1 2 ProxT(f0,·)(h) = argmin T(f0, f1) + kf1 − hk2. f1 2σ Thus we want to solve T 1 2 min trace(C M) + D(M) + kf1 − hk2 M≥0,f1 2σ

subject to f0 = M1 f = MT 1. 1 Compare to 1 2 Using dual coordinate ascent, with g(·) = k · −hk2, 2σ 1 u0 = f0./(Ku1) we get the algorithm: T 2 u1 = f1./(K u0) 1 u0 = f0./(Ku1) 2 h h T   u1 = exp σ − ω σ + log K u0) + log(σ) Here ω denotes the (elementwise) Wright omega function, i.e., x = log(ω(x)) + ω(x). T Solved elementwise. Bottleneck is still computation of Ku1, K u0.

18 / 24 Inverse problems with optimal mass transport priors Generalized Sinkhorn iterations

We want to compute the proximal of T(f0, ·), given by σ 1 2 ProxT(f0,·)(h) = argmin T(f0, f1) + kf1 − hk2. f1 2σ Thus we want to solve T 1 2 min trace(C M) + D(M) + kf1 − hk2 M≥0,f1 2σ

subject to f0 = M1 f = MT 1. 1 Compare to 1 2 Using dual coordinate ascent, with g(·) = k · −hk2, 2σ 1 u0 = f0./(Ku1) we get the algorithm: T 2 u1 = f1./(K u0) 1 u0 = f0./(Ku1) 2 h h T   u1 = exp σ − ω σ + log K u0) + log(σ) Here ω denotes the (elementwise) Wright omega function, i.e., x = log(ω(x)) + ω(x). T Solved elementwise. Bottleneck is still computation of Ku1, K u0.

Theorem The algorithm is globally convergent, and with linear convergence rate.

18 / 24 Example in computerized tomography

Computerized Tomography (CT): imaging modality used in many areas, e.g., medicine. The object is probed with X-rays. Different materials attenuates X-rays differently =⇒ incoming and outgoing intensities gives information about the object. Simplest model

Z  I  f (x)dx = log 0 , I Lr,θ

f (x) is the attenuation in the point x, which is what we want to reconstruct,

Lr,θ is the line along which the X-ray beam travels,

I0 and I are the the incoming and outgoing intensities.

Illustration from Wikipedia

19 / 24 2 TV-regularization and `2 prior: 2 min γkf0 − f1k2 + k∇f1k1 f1

subject to kAf1 − wk2 ≤ κ. TV-regularization and optimal transport prior:

min γT(f0, f1) + k∇f1k1 f1

subject to kAf1 − wk2 ≤ κ.

2 (c) TV-regularization (d) TV-regularization and `2-prior (e) TV-regularization and optimal (γ = 10) transport prior (γ = 4)

Example in computerized tomography

Parallel beam 2D CT example: Reconstruction space: 256 × 256 pixels Angles: 30 in [π/4, 3π/4] (limited angle) Detector partition: uniform 350 bins Noise level 5%

(a) Shepp-Logan phantom (b) Prior

20 / 24 2 (h) TV-regularization (i) TV-regularization and `2-prior (j) TV-regularization and optimal (γ = 10) transport prior (γ = 4)

Example in computerized tomography

2 TV-regularization and `2 prior: 2 min γkf0 − f1k2 + k∇f1k1 f1

subject to kAf1 − wk2 ≤ κ. TV-regularization and optimal transport prior:

min γT(f0, f1) + k∇f1k1 f1 Shepp-Logan phantom Prior subject to kAf1 − wk2 ≤ κ. (f) (g)

20 / 24 Example in computerized tomography

2 TV-regularization and `2 prior: 2 min γkf0 − f1k2 + k∇f1k1 f1

subject to kAf1 − wk2 ≤ κ. TV-regularization and optimal transport prior:

min γT(f0, f1) + k∇f1k1 f1 Shepp-Logan phantom Prior subject to kAf1 − wk2 ≤ κ. (k) (l)

2 (m) TV-regularization (n) TV-regularization and `2-prior (o) TV-regularization and optimal (γ = 10) transport prior (γ = 4) 20 / 24 Example in computerized tomography

2 Comparing different regularization parameters for the problem with `2 prior. 2 min γkf0 − f1k2 + k∇f1k1 f1

subject to kAf1 − wk2 ≤ κ.

(p) γ = 1 (q) γ = 10 (r) γ = 100 (s) γ = 1000 (t) γ = 10 000.

Figure: Reconstructions using `2 prior with different regularization parameters γ.

21 / 24 2 (c) TV-regularization (d) TV-regularization and `2-prior (e) TV-regularization and optimal (γ = 10) transport prior (γ = 4)

Example in computerized tomography

Parallel beam 2D CT example: Reconstruction space: 256 × 256 pixels Angles: 30 in [0, π] Detector partition: uniform 350 bins Noise level 3%

(a) Phantom (b) Prior

22 / 24 Example in computerized tomography

Parallel beam 2D CT example: Reconstruction space: 256 × 256 pixels Angles: 30 in [0, π] Detector partition: uniform 350 bins Noise level 3%

(f) Phantom (g) Prior

2 (h) TV-regularization (i) TV-regularization and `2-prior (j) TV-regularization and optimal (γ = 10) transport prior (γ = 4) 22 / 24 Potential future directions: Application to spatiotemporal image reconstruction:

T T X h i X arg min G(A(fj ), gj ) + λF(fj ) + γH(fj−1, fj ) f ,...,f ∈X 0 T j=0 j=1 More efficient ways of solving the dual problem? Learning for inverse problems using optimal transport as a loss function

Conclusions and further work

Conclusions Optimal mass transport - a viable framework for imaging applications Generalized Sinkhorn iteration for computing the proximal operator of optimal transport cost Use variable splitting for solving the inverse problem Application to CT reconstruction using optimal transport priors

23 / 24 Conclusions and further work

Conclusions Optimal mass transport - a viable framework for imaging applications Generalized Sinkhorn iteration for computing the proximal operator of optimal transport cost Use variable splitting for solving the inverse problem Application to CT reconstruction using optimal transport priors

Potential future directions: Application to spatiotemporal image reconstruction:

T T X h i X arg min G(A(fj ), gj ) + λF(fj ) + γH(fj−1, fj ) f ,...,f ∈X 0 T j=0 j=1 More efficient ways of solving the dual problem? Learning for inverse problems using optimal transport as a loss function

23 / 24 Thank you for your attention!

Questions?

24 / 24