I fell a little bit behind on reporting on my new preprints. In this posts I’ll blog on two closely related ones; one of them already a bit old, the other one quite recent:

The papers are

As clear from the titles, both papers treat a similar method. The first paper contains all the theory and the second one has few particularly interesting applications.

In the first paper we propose to view several known algorithms such as the linearized Bregman method, the Kaczmarz method or the Landweber method from a different angle from which they all are special cases of another algorithm. To start with, consider a linear system

\displaystyle Ax=b

with {A\in{\mathbb R}^{m\times n}}. A fairly simple and old method to solve this, is the Landweber iteration which is

\displaystyle x^{k+1} = x^k - t_k A^T(Ax^k-b).

Obviously, this is nothing else than a gradient descent for the functional {\|Ax-b\|_2^2} and indeed converges to a minimizer of this functional (i.e. a least squares solution) if the stepsizes {t_k} fulfill {\epsilon\leq t_k\leq 2\|A\|^{-2} - \epsilon} for some {\epsilon>0}. If one initializes the method with {x^0=0} it converges to the least squares solution with minimal norm, i.e. to {A^\dag b} (with the pseudo-inverse {A^\dag}).

A totally different method is even older: The Kaczmarz method. Denoting by {a_k} the {k}-th row of {A} and {b_k} the {k}-th entry of {b} the method reads as

\displaystyle x^{k+1} = x^k - a_{r(k)}^T\frac{a_{r(k)}\cdot x^k - b_k}{\|a_{r(k)}\|_2^2}

where {r(k) = (k\mod m) +1} or any other “control sequence” that picks up every index infinitely often. This method also has a simple interpretation: Each equation {a_k\cdot x = b_k} describes a hyperplane in {{\mathbb R}^n}. The method does nothing else than projecting the iterates orthogonally onto the hyperplanes in an iterative manner. In the case that the system has a solution, the method converges to one, and if it is initialized with {x^0=0} we have again convergence to the minimum norm solution {A^\dag b}.

There is yet another method that solves {Ax=b} (but now it’s a bit more recent): The iteration produces two sequences of iterates

\displaystyle \begin{array}{rcl} z^{k+1} & = &z^k - t_k A^T(Ax^k - b)\\ x^{k+1} & = &S_\lambda(z^{k+1}) \end{array}

for some {\lambda>0}, the soft-thresholding function {S_\lambda(x) = \max(|x|-\lambda,0)\mathrm{sgn}(x)} and some stepsize {t_k}. For reasons I will not detail here, this is called the linearized Bregman method. It also converges to a solution of the system. The method is remarkably similar, but different from, the Landweber iteration (if the soft-thresholding function wouldn’t be there, both would be the same). It converges to the solution of {Ax=b} that has the minimum value for the functional {J(x) = \lambda\|x\|_1 + \tfrac12\|x\|_2^2}. Since this solution of close, and for {\lambda} large enough identical, to the minimum {\|\cdot\|_1} solution, the linearized Bregman method is a method for sparse reconstruction and applied in compressed sensing.

Now we put all three methods in a joint framework, and this is the framework of split feasibility problems (SFP). An SFP is a special case of a convex feasibility problems where one wants to find a point {x} in the intersection of multiple simple convex sets. In an SFP one has two different kinds of convex constraints (which I will call “simple” and “difficult” in the following):

  1. Constraints that just demand that {x\in C_i} for some convex sets {C_i}. I call these constraints “simple” because we assume that the projection onto each {C_i} is simple to obtain.
  2. Constraints that demand {A_ix\in Q_i} for some matrices {A_i} and simple convex sets {Q_i}. Although we assume that projections onto the {Q_i} are easy, these constraints are “difficult” because of the presence of the matrices {A_i}.

If there were only simple constraints a very basic method to solve the problem is the methods of alternating projections, also known as POCS (projection onto convex sets): Simply project onto all the sets {C_i} in an iterative manner. For difficult constraints, one can do the following: Construct a hyperplane {H_k} that separates the current iterate {x^k} from the set defined by the constraint {Ax\in Q} and project onto the hyperplane. Since projections onto hyperplanes are simple and since the hyperplane separates we move closer to the constraint set and this is a reasonable step to take. One such separating hyperplane is given as follows: For {x^k} compute {w^k = Ax^k-P_Q(Ax^k)} (with the orthogonal projection {P_Q}) and define

\displaystyle H_k = \{x\ : (A^Tw^k)^T\cdot x \leq (A^Tw^k)^T\cdot x^k - \|w^k\|_2^2\}.

Illustration of projections onto convex sets and separating hyperplanes

Illustration of projections onto convex sets and separating hyperplanes

Now we already can unite the Landweber iteration and the Kaczmarz method as follows: Consider the system {Ax=b} as a split feasibility problem in two different ways:

  1. Treat {Ax=b} as one single difficult constraint (i.e. set {Q=\{b\}}). Some calculations show that the above proposed method leads to the Landweber iteration (with a special stepsize).
  2. Treat {Ax=b} as {m} simple constraints {a_i\cdot x = b_i}. Again, some calculations show that this gives the Kaczmarz method.

Of course, one could also work “block-wise” and consider groups of equations as difficult constraints to obtain “block-Kaczmarz methods”.

Now comes the last twist: By adapting the term of “projection” one gets more methods. Particularly interesting is the notion of Bregman projections which comes from Bregman distances. I will not go into detail here, but Bregman distances are associated to convex functionals {J} and by replacing “projection onto {C_i} or hyperplanes” by respective Bregman projections, one gets another method for split feasibility problems. The two things I found remarkable:

  • The Bregman projection onto hyperplanes is pretty simple. To project some {x^k} onto the hyperplane {H = \{x\ :\ a^T\cdot x\leq \beta\}}, one needs a subgradient {z^k\in\partial J(x^k)} (in fact an “admissible one” but for that detail see the paper) and then performs

    \displaystyle x^{k+1} = \nabla J^*(z^k - t_k a)

    ({J^*} is the convex dual of {J}) with some appropriate stepsize {t_k} (which is the solution of a one-dimensional convex minimization problem). Moreover, {z^{k+1} = z^k - t_k a} is a new admissible subgradient at {x^{k+1}}.

  • If one has a problem with a constraint {Ax=b} (formulated as an SFP in one way or another) the method converges to the minimum-{J} solution of the equation if {J} is strongly convex.

Note that strong convexity of {J} implies differentiability of {J^*} and Lipschitz continuity of {\nabla J} and hence, the Bregman projection can indeed be carried out.

Now one already sees how this relates to the linearized Bregman method: Setting {J(x) = \lambda\|x\|_1 + \tfrac12\|x\|_2^2}, a little calculation shows that

\displaystyle \nabla J^*(z) = S_\lambda(z).

Hence, using the formulation with a “single difficult constraint” leads to the linearized Bregman method with a specific stepsize. It turns out that this stepsize is a pretty good one but also that one can show that a constant stepsize also works as long as it is positive and smaller that {2\|A\|^{-2}}.

In the paper we present several examples how one can use the framework. I see one strengths of this approach that one can add convex constraints to a given problem without getting into any trouble with the algorithmic framework.

The second paper extends a remark that we make in the first one: If one applies the framework of the linearized Bregman method to the case in which one considers the system {Ax=b} as {m} simple (hyperplane-)constraints one obtains a sparse Kaczmarz solver. Indeed one can use the simple iteration

\displaystyle \begin{array}{rcl} z^{k+1} & = &z^k - a_{r(k)}^T\frac{a_{r(k)}\cdot x^k - b_k}{\|a_{r(k)}\|_2^2}\\ x^{k+1} & = &S_\lambda(z^{k+1}) \end{array}

and will converge to the same sparse solution as the linearized Bregman method.

This method has a nice application to “online compressed sensing”: We illustrate this in the paper with an example from radio interferometry. There, large arrays of radio telescopes collect radio emissions from the sky. Each pair of telescopes lead to a single measurement of the Fourier transform of the quantity of interest. Hence, for {k} telescopes, each measurement gives {k(k-1)/2} samples in the Fourier domain. In our example we used data from the Very Large Array telescope which has 27 telescopes leading to 351 Fourier samples. That’s not much, if one want a picture of the emission with several ten thousands of pixels. But the good thing is that the Earth rotates (that’s good for several reasons): When the Earth rotates relative to the sky, the sampling pattern also rotates. Hence, one waits a small amount of time and makes another measurement. Commonly, this is done until the earth has made a half rotation, i.e. one complete measurement takes 12 hours. With the “online compressed sensing” framework we proposed, one can start reconstructing the image as soon the first measurements have arrived. Interestingly, one observes the following behavior: If one monitors the residual of the equation, it goes down during iterations and jumps up when new measurements arrive. But from some point on, the residual stays small! This says that the new measurements do not contradict the previous ones and more interestingly this happened precisely when the reconstruction error dropped down such that “exact reconstruction” in the sense of compressed sensing has happened. In the example of radio interferometry, this happened after 2.5 hours!

Reconstruction by online compressed sensing

Reconstruction by online compressed sensing

You can find slides of a talk I gave at the Sparse Tomo Days here.

Today I’d like to collect some comments one a few papers I stumbled upon recently on the arXiv.

1. TGV minimizers in 1D

First, about a month ago two very similar paper appeared in the same week:

Both papers treat the recently proposed “total generalized variation” model (which is a somehow-but-not-really-higher-order generalization of total variation). The total variation of a function {u\in L^1(\Omega)} ({\Omega\subset{\mathbb R}^d}) is defined by duality

\displaystyle  TV(u) = \sup\Big\{\int_\Omega \mathrm{div} \phi\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,{\mathbb R}^d), |\phi|\leq 1\Big\}.

(Note that the demanded high regularity of the test functions {\phi} is not essential here, as we take a supremum over all these functions under the only, but important, requirement that the functions are bounded. Test functions from {C^1_c(\Omega,{\mathbb R}^d)} would also do.)

Several possibilities for extensions and generalization of the total variation exist by somehow including higher order derivatives. The “total generalized variation” is a particular successful approach which reads as (now using two non-negative parameter {\alpha,\beta} which do a weighting):

\displaystyle  TGV_{\beta,\alpha}^2(u) = \sup\Big\{\int_\Omega \mathrm{div}^2 \phi\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,S^{d\times d}),\ |\phi|\leq \beta,\ |\mathrm{div}\phi|\leq \alpha\Big\}.

To clarify some notation: {S^{d\times d}} are the symmetric {d\times d} matrices, {\mathrm{div}^n} is the negative adjoint of {\nabla^n} which is the differential operator that collects all partial derivatives up to the {n}-th order in a {d\times\cdots\times d}-tensor. Moreover {|\phi|} is some matrix norm (e.g. the Frobenius norm) and {|\mathrm{div}\phi|} is some vector norm (e.g. the 2-norm).

Both papers investigate so called denoising problems with TGV penalty and {L^2} discrepancy, i.e. minimization problems

\displaystyle  \min_u \frac12\int_\Omega(u-u^0)^2\, dx + TGV_{\alpha,\beta}^2(u)

for a given {u^0}. Moreover, both papers treat the one dimensional case and investigate very special cases in which they calculate minimizers analytically. In one dimension the definition of {TGV^2} becomes a little more familiar:

\displaystyle  TGV_{\beta,\alpha}^2(u) = \sup\Big\{\int_\Omega \phi''\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,{\mathbb R}),\ |\phi|\leq \beta,\ |\phi'|\leq \alpha\Big\}.

Some images of both papar are really similar: This one from Papafitsoros and Bredies


and this one from Pöschl and Scherzer

Although both paper have a very similar scopes it is worth to read both. The calculations are tedious but both paper try to make them accessible and try hard (and did a good job) to provide helpful illustrations. Curiously, the earlier paper cites the later one but not conversely…

2. Generalized conditional gradient methods

Another paper I found very interesting was

This paper shows a nice duality which I haven’t been aware of, namely the one between the subgradient descent methods and conditional gradient methods. In fact the conditional gradient method which is treated is a generalization of the conditional gradient method which Kristian and I also proposed a while ago in the context of {\ell^1}-minimization in the paper Iterated hard shrinkage for minimization problems with sparsity constraints: To minimize the sum

\displaystyle  F(u) + \Phi(u)

with a differentiable {F} and a convex {\Phi} for which the subgradient of {\Phi} is easily invertible (or, put differently, for which you can minimize {\langle u,a\rangle + \Phi(u)} easily), perform the following iteration:

  1. At iterate {u^n} linearize {F} but not {\Phi} and calculate a new point {v^n} by

    \displaystyle  v^n = \mathrm{argmin}_v \langle F'(u^n),v\rangle + \Phi(v)

  2. Choose a stepsize {s^n\in [0,1]} and set the next iterate as a convex combination of {u^n} and {v^n}

    \displaystyle  u^{n+1} = u^n + s_n(v^n - u^n).

Note that for and indicator function

\displaystyle  \Phi(u) = \begin{cases} 0 & u\in C\\ \infty & \text{else} \end{cases}

you obtain the conditional gradient method (also known as Frank-Wolfe method). While Kristian and I derived convergence with an asymptotic rate for the case of {F(u) = \tfrac12\|Ku-f\|^2} and {\Phi} strongly coercive, Francis uses the formulation {F(u) = f(Au)} the assumption that the dual {f^*} of {f} has a bounded effective domain (which say that {f} has linear growth in all directions). With this assumption he obtains explicit constants and rates also for the primal-dual gap. It was great to see that eventually somebody really took the idea from the paper Iterated hard shrinkage for minimization problems with sparsity constraints (and does not think that we do heuristics for {\ell^0} minimization…).

I found this draft of a post in my backlog and decided that it would be better to finish it than to leave it unpublished. Actually, the story happened already over a year ago.

Some month ago I stumbled upon this piece of shadow art

by Fred Eerdekens and a few days later I received the information that my university was going to celebrate his yearly “open house event” this year as “TU Night”. The theme of the TU Night was “Night, Light, Energy” and all member were invited to submit ideas for talks, experiments and exhibits.

The piece of shadow art prompted the question “If this weired piece of metal can cast this sentence as a shadow, wouldn’t it be possible to produce another piece of metal that can produce two different sentences, when illuminated from different directions” Together with the announcement of the upcoming TU Night I thought if one could even produce an exhibit like this.

Since I am by no means an artists, I looked around at my university and found that there is a Department of Architecture. Since architects are much closer to being artist than I am, I contacted the department and proposed a collaboration and well, Henri Greil proposed to have a joint seminar on this topic. Hence, this summer term I made the experience and worked with students of architecture.

In the end, the student produced very nice pieces of shadow art:




Although the exhibits produced interesting and unexpected shadows, no group of students could make it and produce two different shadows out of the same object.

However, some nice effects can be produced pretty easy:

The basic idea is that moving one object around will move around both shadows rather independently. Well this is not totally true but what you can do is to “zoom” one shadow while moving the other sideways (just move the object straight towards one light source). See this movie for a small illustration:

I also did my best to produce a more complex object. While it is theoretically not very difficult to see that some given shadows are possible in some given projection geometry, it is not at all straight forward to produce the object theoretically (not to speak of the real world problems while building the piece). It tried hard but I could not do better than this:

In this post I will explore a bit the question on how to calculate the discrete gradient and the discrete divergence of an image and a vector field, respectively.

Let {u_0\in{\mathbb R}^{N\times M}} be a discrete image, i.e. {u_0(i,j)} denotes the gray value at the {i,j}-th pixel. The famous total variation denoising amounts to minimizing the functional

\displaystyle \Phi(u) = \tfrac12 \int (u-u_0)^2 + \lambda\int|\nabla u|

where the integral shall be understood as summation, {\nabla} stand for the gradient, i.e. the vector of the partial derivatives, and {|\cdot|} stands for the euclidean absolute value.

When using primal-dual methods, it is of crucial importance, that the used operators for the gradient and the divergence are numerically adjoint (up to the minus sign). That is, the numerical operations are adjoint in the following sense: If grad is the operation for the gradient and div is the operation for the divergence, then it should hold for any variables {u} and {v} of suitable size and with gradu = grad(u), divv = div(v) that sum(gradu(:).*v(:)) and -sum(u(:).*divv(:)) are equal up to numerical precision. Due to the boundary treatment, the internal MATLAB operations gradient and divergence do not fulfill this requirement.

The most common discretization of the gradient uses discrete forward differences and a constant padding at the boundary (which means that Neumann boundary values are applied). In formula, this reads as

\displaystyle (D_xu)_{i,j} = \begin{cases} u_{i+1,j} - u_{i,j} & i<N\\ 0 & i=N \end{cases} \qquad (D_yu)_{i,j} = \begin{cases} u_{i,j+1} - u_{i,j} & j<M\\ 0 & j=M \end{cases}

The respective adjoints are backward differences with zero boundary treatment (check it!). Apparently, there are many different ways to implement this routines (and the respective adjoints) in MATLAB. Here are four of them:

  1. For loops: Run through all pixels in two for-loops and assign the difference to the output. Of course, one should preallocate the output prior to the loop. But you may probably know the first two rules of MATLAB coding? If not, here they are: 1. Avoid for-loops. 2. Avoid for-loops, seriously. I put the routines into extra functions and created anonymous function to call the gradient and the divergence as
    grad = @(u) cat(3,dxp(u),dyp(u));
    div = @(V) dxm(V(:,:,1)) + dym(V(:,:,2));
  2. Shift and subtract: MATLAB is great in using vectors and matrices. And to avoid the for-loop one could also implement the forward difference in {x}-direction by shifting the matrix and subtract the original one, i.e. [u(:,2:end) u(:,end)] - u (and similarly for the other differences). Again, I wrote extra functions and used anonymous function as above.
  3. Small sparse matrices from the left and from the right: MATLAB is also pretty good with sparse matrices. Since the derivatives in {x}-direction only involve the subtraction of two elements in the same column, one can realize this by multiplying an image from the left with a sparse diagonal matrix with just two non-zero diagonals. Similarly, the derivative in {y}-direction can be realized by multiplying from the right with a suitable matrix. More precisely, this approach is realized by
    Dy = spdiags([-ones(M,1) ones(M,1)],[0 1],M,M);
    Dy(M,:) = 0;
    Dx = spdiags([-ones(N,1) ones(N,1)],[0 1],N,N);
    Dy(N,:) = 0;
    Dxu = Dx*u;
    Dyu = u*Dy';

    (check it). Note that the adjoint of {x}-derivative if simple the operation Dx'*u and the operation of the {y}-derivative is u*Dy. Together, the calculation of the gradient and the divergence was done by the anonymous functions

    grad = @(u) cat(3,u*Dx',Dy*u);
    div = @(V) V(:,:,1)*Dx + Dy'*V(:,:,2);
  4. Large sparse matrices: One could think about the following: Vectorize the image by U = u(:) (which amounts to stacking the columns above each other). Then assemble a large {NM\times NM} sparse matrix which has just two non-zero diagonals to do the forward (and other) differences. More precisely this can be done by
    (with Dx and Dy from above)

    DX = kron(Dx,speye(M));
    DY = kron(speye(N),Dy);
    DxU = DX*U;
    DyU = DY*U;

    Here, it is clear that the respective adjoint are just the multiplication with the transposed matrices. Here, the anonymous functions are

    grad = @(u) [DX*u DY*u];
    div = @(V) DX'*V(:,1) + DY'*V(:,2);

The different approaches have different pros and cons. Well, the for-loop only has cons: It is presumably slow and uses a lot indexing which easily leads to bugs. The shift-and-subtract method should go into an extra function to make it easy to use – but this is not necessarily a drawback. For the multiplication with the large matrix, one has to vectorize the image first and every time one wants to look at the result and need to do reshape(U,N,M). But let’s focus on speed: I implemented all methods, and let the run on square images of different sizes for 50 times (after a warmup) and measures the execution times with tic and toc. The assembly of the matrices did not enter the timing. Moreover, no parallel computation was used – just a single core. Finally, memory was not an issue since the larges matrices (of size {2500\times 2500\times 2}) only need roughly 100MB of memory.

Here is the table with the average times for one calculation of the gradient (in seconds):

{N} For-loop Shift-subtract left-right left
100 0.0004 0.0002 0.0001 0.0003
200 0.0018 0.0010 0.0005 0.0015
300 0.0057 0.0011 0.0014 0.0020
400 0.0096 0.0031 0.0022 0.0035
500 0.0178 0.0035 0.0030 0.0054
750 0.0449 0.0114 0.0097 0.0123
1000 0.0737 0.0189 0.0128 0.0212
1500 0.2055 0.0576 0.0379 0.0601
2000 0.3942 0.0915 0.0671 0.1136
2500 0.6719 0.1571 0.1068 0.1788

and here is the one for the divergences:

{N} For-loop Shift-subtract left-right left
100 0.0004 0.0003 0.0002 0.0002
200 0.0018 0.0015 0.0005 0.0008
300 0.0048 0.0016 0.0015 0.0012
400 0.0090 0.0036 0.0020 0.0022
500 0.0158 0.0057 0.0027 0.0035
750 0.0409 0.0132 0.0073 0.0069
1000 0.0708 0.0238 0.0130 0.0125
1500 0.2008 0.0654 0.0344 0.0370
2000 0.3886 0.1285 0.0622 0.0671
2500 0.6627 0.2512 0.1084 0.1361

As expected, the for-loop is clearly slower and also as expected, all methods basically scale quadratically (doubling {N} amounts to multiplying the running time by four) since the work per pixel is constant. A little surprisingly, the multiplication from the left and from the right is fastest and also consistently a little faster then multiplication from the left with the larger sparse matrix. I don’t know, why the results are that different for the gradient and for the divergence. Maybe this is related to my use on anonymous functions or the allocation of memory?

The second day of SSVM started with an invited lecture of Tony Lindeberg, who has written one very influential and very early book about scale space theory. His talk was both a tour through scale space and a recap of the recent developments in the field. Especially he show how the time aspect could be incorporated into scale space analysis by a close inspection of how receptive fiels are working. There were more talks but I only took notes from the talk of Jan Lellmann who talked about the problem of generating an elevation map from a few given level lines. One application of this could be to observe coastlines at different tides and then trying the reconstruct the full height map at the coast. One specific feature here is that the surface one looks for may have ridges which stem from kinks in the level lines and these ridges are important features of the surface. He argued that a pure convex regularization will not work and proposed to use more input namely a vector field which is derived from the contour lines such that the vector somehow “follows the ridges”, i.e. it connects to level lines in a correct way.

Finally another observation I had today: Well, this is not a trend, but a notion which I heard for the first time here but which sounds very natural is the informal classification of data terms in variational models as “weak” or “strong”. For example, a denoising data term {\|u-u^0\|^2_{L^2(\Omega)}} is a strong data term because it gives tight information on the whole set {\Omega}. On the other hand, an inpainting data term {u|_{\Omega\setminus\Omega'} = u^0|_{\Omega\setminus\Omega'}} is a weak data term because it basically tell nothing within the region {\Omega'}.

For afternoon the whole conference has been on tour to three amazing places:

  • the Riegersburg, which is not only an impressive castle but also features interesting exhibitions about old arm and witches,
  • the Zotter chocolate factory where they make amazing chocolate in mind-boggling varieties,
  • and to Schloss Kronberg for the conference dinner (although it was pretty tough to start eating the dinner after visiting Zotter…).

I am at the SSVM 13 and the first day is over. The conference is in a very cozy place in Austria, namely the Schloss Seggau which is located on a small hill near the small town Leibnitz in Styria (which is in Austria but totally unaffected by the floods). The conference is also not very large but there is a good crowd of interesting people here and the program reads interesting.

Schloss Seggau

The time schedule is tight and I don’t know if I can make it to post something everyday. Especially, I will not be able to blog about every talk.

From the first day I have the impression that two things have appeared frequently at different places:

  • Differential geometry: Tools from differential geometry like the notion of geodesics or Riemannian metrics have not only been used by Martin Rumpf to model different types of spaces of shapes but also to interpolate between textures.
  • Primal dual methods: Variational models should be convex these days and if they are not you should make them convex somehow. Then you can use primal dual methods. Somehow the magic of primal dual methods is that they are incredibly flexible. Also they work robustly and reasonably fast.

Probably, there are more trends to see in the next days.

Another few notes to myself:

This term I am particularly busy as I am writing a book about variational methods in imaging. The book will be a textbook and I am writing it parallel to a lecture I am currently teaching. (And it is planned that Kristian Bredies, the co-author, teaches the same stuff next term – then there will be another few month of editing, so it will be at least a year until publishing.)

In the book we will treat variational approaches to a variety of basic imaging problems. Of course we treat denoising and deblurring but there will also be sections about image interpolation, segmentation and optical flow. In the first part of the book, we present the variational problem and model them properly in Lebesgue and Sobolev spaces and of course in the space {BV}. Some effort goes into the analysis of the models and the first step is usually to establish existence of solutions, i.e. minimizers of the respective minimization problems. The work horse is the direct method in the calculus of variations and we mainly use the method for convex functionals in Banach spaces.

When I started the section on optical flow I noticed that I hadn’t thought about existence of minimizers before and moreover, most papers and books do not treat this issue. Let’s recall the method of Horn and Schunck to calculate the optical flow:

For two images {u_0}, {u_1} defined on a domain {\Omega\subset{\mathbb R}^2} one seeks a flow field {V:\Omega\rightarrow{\mathbb R}^2} such that {V} the describes the apparent motion that has happened between both images. Assuming that the points keep their gray value during motion (an assumption known as the brightness constancy constraint) and linearizing this assumption one arrives at the condition

\displaystyle  \frac{u_1-u_0}{dt} + V\cdot\nabla u_0 = 0

(where {dt} is the time between the images {u_0} and {u_1}). First, this does not give enough equations to determine {V} and secondly, points with {\nabla u_0=0} are problematic.

Horn and Schunck proposed to loose the constraint and to enforce some smoothness of the flow field {V}: Their model was to minimize

\displaystyle  F(u) = \int_\Omega\big(\frac{u_1-u_0}{dt} + V\cdot\nabla u_0\big)^2{\mathrm d}{x} + \lambda\int_\Omega|\nabla V|^2{\mathrm d}{x}

for some parameter {\lambda} weighting smoothness of {V} (large {\lambda}) against the brightness constancy constraint (small {\lambda}). A little bit more general one could choose exponents {p} and {q} and minimize

\displaystyle  F(u) = \int_\Omega\big|\frac{u_1-u_0}{dt} + V\cdot\nabla u_0\big|^q{\mathrm d}{x} + \lambda\int_\Omega|\nabla V|^p{\mathrm d}{x}.

To apply the direct method to obtain existence of minimizers of {F} one ensures

  1. properness, i.e. there is some {V} such that {F} is finite,
  2. convexity of {F},
  3. lower semi-continuiuty of {F} and
  4. coercivity of {F}.

To check these things one has to choose an appropriate space to work in. It seems reasonable to choose {V\in L^{q}(\Omega,{\mathbb R}^2)}. Then properness of {F} is easy (consider {V=0}, of course assuming that {u_1-u_0\in L^q(\Omega)}). Convexity is also clear and for lower semi-continuity one has to work a little more, but that is possible if, e.g., {\nabla u_0} is bounded. Coercivity is not that clear and in fact {F} is not coercive in general.

Example 1 (Non-coercivity of the Horn-and-Schunck-model) Simply consider {u_0(x,y) = ax + by} for some {a,b\in{\mathbb R}}. Then {\nabla u(x,y) \equiv [a\ b]^T}. Set {V_n(x,y) \equiv [-nb\ na]^T} and note that {\|V^n\|_q\rightarrow\infty} while {F(V^n)} stays bounded (in fact constant).

I just checked the book “Mathematical problems in Imaging” by Gilles Aubert and Pierre Kornprobst and in Section 5.3.2 they mention that the Horn and Schunck model is not coercive. They add another term to {F} which is roughly a weighted norm of {V} which ensures coercivity. However, it turns out that coercivity of {F} is true under a mild assumption of {u_0}. The idea can be found in a pretty old paper by Christoph Schnörr which is called “ Determining Optical Flow for Irregular Domains by Minimizing Quadratic Functionals of a Certain Class” (Int. J. of Comp. Vision, 6(1):25–38, 1991). His argument works for {q=2}:

Theorem 1 Let {\Omega\subset{\mathbb R}^2} be a bounded Lipschitz domain, {u_0,u_1\in L^2(\Omega)} with {\nabla u_0\in L^\infty(\Omega)} such that {\partial_x u_0} and {\partial_y u_0} are linearly independent in {L^2(\Omega)} and let {1<p<\infty}. Then it holds that {F:L^2(\Omega)\rightarrow {\mathbb R}\cup\{\infty\}} defined by

\displaystyle  F(u) = \int_\Omega\big(\frac{u_1-u_0}{dt} + V\cdot\nabla u_0\big)^2{\mathrm d}{x} + \lambda\int_\Omega|\nabla V|^2{\mathrm d}{x}

is coercive.

Proof: Now consider {V^n} such that {\|V^n\|_2\rightarrow\infty}. Now we decompose the components of {V} into the constant parts {QV^n_x} and {QV^n_y} and the “zero-mean”-part {PV^n_x = V^n_x - QV^n_x} and {PV^n_y = V^n_y - QV^n_y}. First consider that {PV^n} is unbounded, i.e. there is subsequence (also denoted by {V^n}) such that {\|PV^n\|_2\rightarrow\infty}. By Sobolev embedding and the \href{ inequality}, we get that {\int_\Omega|\nabla V^n|^p{\mathrm d}{x}\rightarrow\infty}.

Now consider bounded {PV^n} and hence, unbounded mean values {QV^n}. Using a subsequence, we assume that {QV^n\rightarrow\infty}. Now we use

\displaystyle   \Big\|\frac{u_1 - u_0}{\Delta t} + V\cdot \nabla u_0\Big\|_2 \geq \Big\|QV\cdot\nabla u_0\Big\|_2 - \Big\|\frac{u_1 - u_0}{\Delta t} + PV\cdot \nabla u_0\Big\|_2 \ \ \ \ \ (1)

and estimate the first term from below, noticing that {QV_x} and {QV_y} are constants, by

\displaystyle  \begin{array}{rcl}  \|QV\cdot\nabla u_0\|_2^2 & = &\|QV_x\,\partial_x u_0 + QV_y\,\partial_y u_0\|_2^2\\ & = & \|QV_x\,\partial_x u_0\|_2^2 + \|QV_y\,\partial_y u_0\|_2^2 + 2\langle QV_x\,\partial_x u_0,QV_y\,\partial_y u_0\rangle\\ & \geq &|QV_x|^2\|\partial_x u_0\|_2^2 + |QV_y|^2\|\partial_y u_0\|_2^2\\ &&\qquad - \|QV_x\,\partial_xu_0\|_2\|QV_y\,\partial_yu_0\|_2\,2\frac{|\langle \partial_x u_0,\partial_y u_0\rangle|}{\|\partial_xu_0\|_2\|\partial_y u_0\|_2}\\ & \geq &(|QV_x|^2\|\partial_x u_0\|_2^2 + |QV_y|^2\|\partial_y u_0\|_2^2) \Big(1 - \frac{|\langle \partial_x u_0,\partial_y u_0\rangle|}{\|\partial_xu_0\|_2\|\partial_y u_0\|_2}\Big). \end{array}

Since {\partial_x u_0} and {\partial_y u_0} are linearly independent, it holds that {1 - \frac{|\langle \partial_x u_0,\partial_y u_0\rangle|}{\|\partial_xu_0\|_2\|\partial_y u_0\|_2}>0} and we conclude that {\|QV^{n_k}\|_2\rightarrow\infty} implies that {\|QV^{n_k}\cdot\nabla u_0\|_2^2\rightarrow\infty}. Together with~(1) and boundedness of {PV^{n_k}} we obtain that {F(V^{n_k})\rightarrow\infty}. Since for every subsequence of {V^n} we get another subsequence {V^{n_k}} such that {F(V^{n_k})\rightarrow\infty}, the same conclusion holds for the whole sequence, showing coercivity of {F}. \Box

Basically the same arguments works for {TV} optical flow, i.e. coercivity of

\displaystyle  F(u) = \int_\Omega\big(\frac{u_1-u_0}{dt} + V\cdot\nabla u_0\big)^2{\mathrm d}{x} + \lambda TV(V).

However, I do not know yet what happens for {q\neq 2} and if the result on coercivity is “sharp” in the sense that linear independence of {\partial_x u_0} and {\partial_y u_0} is necessary. Also, I don’t know yet what is true in dimensions higher than {2}.

Here I continue my previous post on methods to compare shapes. In that post we started with different metrics between two probability measures {\mu_1} and {\mu_2} defined on the same set {X}, namely the Prokhorov metric and the Wasserstein metric. Then we also had a look on the Hausdorff metric which measures the distance between two compact subsets {A} and {B} of a metric space {(X,d)} and finally considered the Gromov-Hausdorff metric between two metric spaces {(X,d_X)} and {(Y,d_Y)}. Our tools have been

  • set couplings of {X} and {Y}, i.e. subsets {R\subset X\times Y} such that for all {x\in X} there {y\in Y} such that {(x,y)\in R} and for all {y\in Y} there is {x\in X} such that {(x,y)\in R},
  • metric couplings of {d_X} and {d_Y}, i.e. metrics {d} on the disjoint union of {X} and {Y} such that {d(x,x') = d_X(x,x')} if {x} and {x'} are in {X} and {d(y,y') = d_Y(y,y')} if {y} and {y'} are in {Y}. In fact, one could also work with semi-metrics {d}, i.e. they do not need to be positive definite, and
  • measure couplings of {\mu_X} and {\mu_Y}, i.e. measures {\nu} on {X\times Y} such that {\nu(A\times Y) = \mu_X(A)} and {\nu(X\times B) = \mu_Y(B)} for all {\mu_X}-/{\mu_Y}-measurable set {A} and {B}, respectively.

Now we make the next (and final) step and compare metric spaces {(X,d_X)} and {(Y,d_Y)} which are both equipped with a measure. These objects are known as metric measure spaces or mm-spaces and are formally defined as follows:

Definition 1 (mm-space) A metric measure space is a tripel {(X,d_X,\mu_X)} consisting a compact metric space {(X,d_X)} and a Borel probability measure {\mu_X} on {X}.

Note that sometimes it is included that {\mu_X} has full support (i.e., equal to {X}) in the definition of an mm-space, but it seems that not everybody does it like that.

1. Comparing mm-spaces: Gromov-Wasserstein

Our question is: How to we compare two mm-spaces {(X,d_X,\mu_X)} and {(Y,d_Y,\mu_Y)}? The plan is simply to augment the previous versions of the Gromov-Hausdorff distances defined in my previos post here and here by something which takes the measures on the respective metric spaces into account. We recall both formulations of the Gromov-Hausdorff distance: The first is

\displaystyle   d_{GH}(X,Y) = \inf_{R,d} \sup_{(x,y)\in R} d(x,y) \ \ \ \ \ (1)

where the infimum is taken over all set couplings {R} of {X} and {Y} and metric couplings {d} of {d_X} and {d_Y}, and the second is

\displaystyle   d_{GH}(X,Y) = \tfrac12\inf_R \sup_{\overset{\overset{x_{1/2}\in X}{y_{1/2}\in Y}}{(x_i,y_i)\in R}}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big| \ \ \ \ \ (2)

where the infimum is also taken over all set couplings {R} of {X} and {Y}.

Basically, we have already seen what we should do to build a metric between mm-spaces. The idea is: if there were some “natural measure” {\mu_R} for any set coupling {R} of {X} and {Y}, then we would simply define the distance between {(X,d_X,\mu_X)} and {(Y,d_Y,\mu_Y)} as

\displaystyle  d_{GH}(X,Y) = \inf_{R,d} \Big(\int d(x,y)^pd\mu_R\Big)^{1/p}

(as a generalization of (1)) and

\displaystyle  d_{GH}(X,Y) = \tfrac12\inf_R \Big(\int_{R\times R}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big|^p d\mu_R(x_1,y_1)d\mu_R(x_2,y_2)\Big)^{1/p}

(as a generalization of (2)). In both cases we can have {1\leq p<\infty}. Note that the obvious modification for {p=\infty} leads to something very similar to the Gromov-Hausdorff metrics.

But indeed there are “natural measures” on the set couplings of {X} and {Y}: At least for the full coupling {R= X\times Y} there are the measure couplings {\nu} of {\mu_X} and {\mu_Y}! (One does not need to consider smaller set couplings {R\subsetneq X\times Y} since this can be taken into account by the measure couplings; they do not need to have full support anyway.)

Applying this idea to the version (1) of the Gromov-Hausdorff metric we arrive at the following expression, which can be called Gromov-Wasserstein metric,

\displaystyle   d_{Gw}^1(X,Y) = \inf_{\nu,d} \Big(\int_{X\times Y} d(x,y)^pd\nu(x,y)\Big)^{1/p} \ \ \ \ \ (3)

where the infimum is taken over all measure couplings {\nu} of {\mu_X} and {\mu_Y} and all metric couplings {d} of {d_X} and {d_Y}.

Starting from the version (2) of the Gromov-Hausdorff metric we arrive at another formulation:

\displaystyle   d_{GW}^2(X,Y) = \tfrac12\inf_\nu \Big(\int_{X\times Y}\int_{X\times Y}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big|^p d\nu(x_1,y_1)d\nu(x_2,y_2)\Big)^{1/p} \ \ \ \ \ (4)

where the infimum is taken over all measure couplings {\nu} of {\mu_X} and {\mu_Y}.

While both versions of the Gromov-Hausdorff metric for compact metric spaces where equal, the same is not true for both generalizations to mm-spaces: In his paper Memoli proves that {d_{GW}^2\leq d_{GW}^1} and gives an example (right after Remark 5.14) where strict inequality holds.

2. Comparing mm-spaces: Gromov-Prokhorov

Instead of starting from the Gromov-Hausdorff distance between metric spaces and augmenting their definition with something that takes the measures into account, we could also start from the “>Prokhorov metric between probability measures and augment the definition with something that takes the metric into account. In fact there are also two possibilities to do so: In the appendix of his paper, Memoli quotes this version (from this paper by Greven, Pfaffelhuber and Winter) of a metric between mm-spaces which we call Gromov-Prokhorov metric

\displaystyle   d_{GP}^1(X,Y) = \inf_{\nu,d}\Big\{\epsilon>0\ :\ \nu\{(x,y)\ :\ d(x,y)\geq\epsilon\}\leq\epsilon\Big\} \ \ \ \ \ (5)

where the infimum is taken over all measure couplings {\nu} of {\mu_X} and {\mu_Y} and all metric couplings {d} of {d_X} and {d_Y}.

The next version (also from the same paper by Greven, Pfaffelhuber and Winter where it was called Eurandom metric) is

\displaystyle   d_{GP}^2(X,Y) = \inf_\nu\Big\{\epsilon>0\ :\ \nu\otimes\nu\{(x_1,y_1,x_2,y_2)\ :\ |d_X(x_1,x_2) - d_Y(y_1,y_2)|\geq\epsilon\}\leq\epsilon\Big\} \ \ \ \ \ (6)

where the infimum is taken over all measure couplings {\nu} only.

3. A very simple example

The calculation of the proposed metrics by hand can be quite cumbersome. Let’s look at the simplest example.

We consider metric spaces {X,Y\subset{\mathbb R}^d} (with the euclidean metric) accompanied with the measures {\mu_X=\delta_{x_0}} and {\mu_Y = \delta_{y_0}} for some points {x_0\in X} and {y_0\in Y}. In this case the is only one measure coupling of {\mu_X} and {\mu_Y}, namely

\displaystyle  \nu = \delta_{(x_0,y_0)}.

Now it is easy to calculate the variant (4) of the Gromov-Wasserstein metric:

\displaystyle  \begin{array}{rcl}  d_{GW}^2(X,Y) &=& \tfrac12\Big(\int_{X\times Y}\int_{X\times Y}\big| |x_1-x_2| - |y_1 -y_2|\big|^p d\delta_{(x_0,y_0)}(x_1,y_1)d\delta_{(x_0,y_0)}\Big)^{1/p} \\ &=& \tfrac12\Big(\int_{X\times Y}\big| |x_0-x_2| - |y_0 -y_2|\big|^pd\delta_{(x_0,y_0)}\Big)^{1/p} \\ &=& 0. \end{array}

Let’s have a look at the variant (3): Since there is only one measure coupling, the metric is

\displaystyle  \begin{array}{rcl}  d_{GW}^1(X,Y) & = &\inf_d\Big(\int_{X\times Y} d(x,y)^pd\delta_{(x_0,y_0)}(x,y)\Big)^{1/p} \\ & = &\inf_d d(x_0,y_0). \end{array}

As we have learned in Example 4 in the previous post, we can find a metric coupling of {X} and {Y} that brings the points {x_0} in {X} and {y_0} in {Y} arbitrarily close together (by embedding both {X} and {Y} into some {{\mathbb R}^n} such that these points are only {\epsilon}-far away from each other). Hence, we see that we have

\displaystyle  d_{GW}^1(X,Y) = 0

similarly to {d_{GW}^2}.

Now let’s look at the Gromov-Prokhorov metric from (5). Again we only have one measure coupling and we get

\displaystyle  d_{GP}^1(X,Y) = \inf\Big\{\epsilon>0\ :\ \exists d\ \text{s.t}\ \delta_{(x_0,y_0)}\{(x,y)\ :\ d(x,y)\geq\epsilon\}\leq\epsilon\Big\}.

Since the measure coupling is a Dirac, we can evaluate

\displaystyle  \delta_{(x_0,y_0)}\{(x,y)\ :\ d(x,y)\geq\epsilon\} = \begin{cases} 1 & d(x_0,y_0)\geq \epsilon\\ 0 & d(x_0,y_0)<\epsilon. \end{cases}

As observed previously, there are metric couplings which bring the points {x_0} and {y_0} arbitrarily close together, and hence for any {\epsilon>0} there is a metric coupling such that {\delta_{(x_0,y_0)}\{(x,y)\ :\ d(x,y)\geq\epsilon\} = 0} which shows that

\displaystyle  d_{GP}^1(X,Y) = 0.

Finally, consider the second variant of the Gromov-Prokhorov metric from (6). Since we only have the one measure coupling {\nu = \delta_{(x_0,y_0)}} we have

\displaystyle  d_{GP}^2(X,Y) = \inf\Big\{\epsilon>0\ :\ \delta_{(x_0,y_0)}\otimes\delta_{(x_0,y_0)}\{(x_1,y_1,x_2,y_2)\ :\ \big| |x_1-x_2| - |y_1-y_2|\big|\geq\epsilon\}<\epsilon\Big\}.

Evaluating the tensored Dirac delta is easy: We have that {\delta_{(x_0,y_0)}\otimes\delta_{(x_0,y_0)}\{(x_1,y_1,x_2,y_2)\ :\ \big| |x_1-x_2| - |y_1-y_2|\big|\geq\epsilon\}} is either one or zero and it is one if and only if the point {(x_0,y_0,x_0,y_0)} is in the set that is measured. However, we have that {\big| |x_0-x_0| - |y_0-y_0|\big|} is, of course, always zero (and never larger than any {\epsilon >0}). Hence, the measure is always zero and hence, this version of the Gromov-Prokhorov distance also gives

\displaystyle  d_{GP}^2(X,Y) = 0.

Note that all four metrics can not see that the two Diracs are at different points. The reason seems to be, that one can “deform the metric space outside of the support of the measures” arbitrarily, in some sense.

It seems, that the computation of {d_{GW}^2} and {d_{GP}^2} were easier, since one needs to know all measure couplings and no metric couplings has been involved.

4. Next difficult example: Compare some mm-space to a point

Ok, we have seen that mm-spaces which carry their measure in just a single point all look alike in both versions of the Gromov-Wasserstein metric and also in both versions of the Gromov-Prokhorov metric. Let’s look at a slightly more difficult example: We consider some mm-space {(X,d_X,\mu)} and want to calculate its distance to a single point, i.e. the mm-space {Y=\{0\}} with the only possible metric and measure. This should somehow measure the “size” or “spread” of the mm-space {X}.

First, we need to know all measure couplings and all metric couplings between these spaces. The measure couplings are very easy: There is just one, namely

\displaystyle  \nu = \mu\otimes \delta

(i.e. all subsets of {X\times \{0\}} are treated as if they were subsets of {X}). Concerning metric couplings, there are a few more. We allow semi-metrics {d} on the disjoint union {X\sqcup\{0\}}: Since {d} should respect the metric {d_X} on {X} we see that all metric couplings are parametrized by the points {x_0} in {X} by identifying {0} (the element in {Y}) with this point {x_0}, i.e. all metric couplings are of the form {d_{x_0}} defined by

\displaystyle  d_{x_0}(x,y) = d_X(x,y)\ (\text{for}\ x,y\in X),\qquad d_{x_0}(x,0) = d_X(x,x_0).

(This only gives semi-metrics since we have {d_{x_0}(x_0,0)=0} although, formally {x_0\not 0}.)

Let’s calculate the first Gromov-Wasserstein metric: There is only one measure coupling and we use the parametrization of the metric couplings to deduce

\displaystyle  \begin{array}{rcl}  d_{GW}^1(X,\{0\}) &=& \inf_d\Big(\int_{X\times\{0\}} d(x,0)^pd(\mu\otimes \delta)(x,y)\Big)^{1/p}\\ &=& \inf_{x_0\in X}\Big(\int_{X} d_X(x,x_0)^pd\mu\Big)^{1/p}. \end{array}

This quantity seems to be known as “minimal {p}-th central moment” of {(X,d_X,\mu)}.

The second variant of the Gromov-Wasserstein metric is (remember, there is only one measure coupling)

\displaystyle  \begin{array}{rcl}  d_{GW}^2(X,\{0\}) &=& \tfrac12\Big(\int_{X\times\{0\}}\int_{X\times\{0\}} |d_X(x_1,x_2) - d_Y(0,0)|^p d(\mu\times\delta)(x_1,y_1) d(\mu\times\delta)(x_2,y_2) \Big)^{1/p}\\ &=& \tfrac12\Big(\int_X\int_X d_X(x_1,x_2)^p d\mu d\mu\Big)^{1/p}. \end{array}

This quantity (without the factor {1/2}) is called the “{p}-diameter” of {(X,d_X)}.

Let’s turn to the Gromov-Prokhorov metrics. The first one is (remember, that the metric couplings {d} are parametrized by the points in {X})

\displaystyle  \begin{array}{rcl}  d_{GP}^1(X,\{0\}) &=& \inf_{d}\Big\{\epsilon>0\ :\ (\mu\otimes\delta)\{(x,0)\ :\ d(x,0)\geq\epsilon\}\leq\epsilon\Big\}\\ &=& \inf_{x_0\in X}\Big\{\epsilon>0\ :\ \mu\{x\ :\ d_X(x,x_0)\geq\epsilon\}\leq\epsilon\Big\}. \end{array}

If this looks familiar, then you may have encountered the Ky-Fan metric already? The Ky-Fan metric is a metric between random variables {\xi_1} and {\xi_2} defined of the same probability space {(X,\mu)} with values in a metric space with metric {d}. It reads as

\displaystyle  d_{KF}(\xi_1,\xi_2) = \inf\Big\{\epsilon>0\ :\ \mu\{\omega\ :\ d(\xi_1(\omega),\xi_2(\omega))\geq\epsilon\}\leq\epsilon\Big\}.

Hence, the first version of the Gromov-Prokhorov metric is

\displaystyle  d_{GP}^1(X,\{0\}) = \inf_{x_0\in X}\ d_{KF}(\mathrm{id},x_0),

i.e., the minimal Ky-Fan metric between the identity mapping and the constant mappings. (In other words, it measures how far the identity is from the constant mappings in the sense of Ky Fan.)

The second variant of the Gromov-Prokhorov metric is (remember, the only measure coupling is {\nu = \mu\times \delta})

\displaystyle  \begin{array}{rcl}  d_{GP}^2(X,\{0\}) &=& \inf \Big\{\epsilon>0\ :\ \nu\otimes\nu\{(x_1,0,x_2,0)\ :\ |d_X(x_1,x_2) -0|\geq\epsilon\}\leq\epsilon\Big\}\\ &=& \inf\Big\{\epsilon>0\ :\ \mu\otimes\mu\{(x_1,x_2)\ :\ d_X(x_1,x_2)\geq\epsilon\}\leq\epsilon\Big\}. \end{array}

I do not have a neat name or a good intuition for this metric yet (although it also looks like it measures “size” or “non-localization” of {X} in some sense). If you have one, let me know!

With this post I delve into a topic which is somehow new to me, although I planned to look deeper into this for quite some time already. I stumbled upon the paper Gromov-Wasserstein distances and the metric approach to object matching by Facundo Mémoli which was a pleasure to read and motivated this post.

1. Comparing measures with norms and metrics

There are different notions in mathematics to compare two objects, think of the size of real numbers, the cardinality of sets or the length of the difference of two vectors. Here we will deal with not only comparison of objects but with “measures of similarity”. Two fundamental notions for this are norms in vector spaces and metrics. The norm is the stronger concept in that it uses more structure than a metric and also, every norm induces a metric but not the other way round. There are occasions in which both a norm and a metric are available but lead to different concepts of similarity. One of these instances occurs in sparse recovery, especially in the continuous formulation, e.g. as described in a previous post. Consider the unit interval {I = [0,1]} and two Radon measures {\mu_1} and {\mu_2} on {I} ({I} could also be an aritrary metric space). On the space of Radon measures {\mathfrak{M}(I)} there is the variation norm

\displaystyle \|\mu\|_{\mathfrak{M}}= \sup_\Pi\sum_{A\in\Pi}|\mu(A)|

where the supremum is taken over all partitions {\Pi} of {I} into a finite number of measurable sets. Moreover, there are different metrics one can put on the space of Radon measures, e.g. the Prokhorov metric which is defined for two probability measures (e.g. non-negative ones with unit total mass)

\displaystyle  \begin{array}{rcl}  d_P(\mu_1,\mu_2) & = & \inf\{\epsilon>0\ :\ \mu_1(A)\leq \mu_2(A^\epsilon) + \epsilon,\nonumber\\ & & \qquad \mu_2(A)\leq \mu_1(A^\epsilon) + \epsilon\ \text{for all measurable}\ A\} \end{array}

where {A^\epsilon} denotes the {\epsilon}-neighborhood of {A}. Another familiy of metrics are the Wasserstein metrics: For {p\geq 1} define

\displaystyle d_{W,p}(\mu_1,\mu_2) = \Big(\inf_\nu\int_{I\times I} |x-y|^p d\nu(x,y)\Big)^{1/p} \ \ \ \ \ (1)

where the infimum is taken over all measure couplings of {\mu_1} and {\mu_2}, that is, all measures {\nu} on {I\times I} such that for measurable {A} it holds that

\displaystyle \nu(A\times I) = \mu_1(A)\ \text{and}\ \nu(I\times A) = \mu_2(A).

Example 1 We compare two Dirac measures {\mu_1 = \delta_{x_1}} and {\mu_2 = \delta_{x_2}} located at distinct points {x_1\neq x_2} in {I} as seen here:

058_two_diracsThe variation norm measures their distance as

\displaystyle \|\mu_1-\mu_2\|_{\mathfrak{M}} = \sup_\Pi\sum_{A\in\Pi}|\delta_{x_1}(A) - \delta_{x_2}(A)| = 2

(choose {\Pi} such that it contains {A_1} and {A_2} small enough that {x_1\in A_1}, {x_2\in A_2} but {x_1\notin A_2} and {x_2\notin A_1}). The calculate the Prokhorov metric note that you only need to consider {A}‘s which contain only one of the points {x_{1/2}} and hence, it evaluates to

\displaystyle d_P(\mu_1,\mu_2) = |x_1-x_2|.

For the Wasserstein metric we observe that there is only one possible measure coupling of {\delta_{x_1}} and {\delta_{x_2}}, namely the measure {\nu = \delta_{(x_1,x_2)}}. Hence, we have

\displaystyle d_{W,p}(\mu_1,\mu_2) = \Big(\int_{I\times I}|x-y|^pd\delta_{(x_1,x_2)}(x,y)\Big)^{1/p} = |x_1-x_2|.

The variation norm distinguishes the two Diracs but is not able to grasp the distance of their supports. On the other hand, both metrics return the geometric distance of the supports in the underlying space {I} as distance of the Diracs. Put in pictures: The variation norm of the difference measures the size ob this object


while both metrics capture the distance of the measures like here

Two diracs

It should not stay unnoted that convergence in both the Prokhorov metric and the Wasserstein metrics is exactly the weak convergence of probability measures.

The above example provides a motivation to study metric structures on spaces, even if they are also equipped with a norm. Another reason to shift attention from normed spaces to metric spaces is the fact that there has emerged a body of work to build a theory of analysis in metric spaces (see, e.g. this answer on mathoverflow or the book Gradient Flows: In Metric Spaces And In The Space Of Probability Measures by Ambrosio, Gigli and Savaré (which puts special emphasis on the space of probability measures)). Yet another motivation for the study of metrics in this way is the problem of comparing shapes (without being precisely defined yet): Which of these shapes look most alike?


(Note that shapes need not to be two dimensional figures, you may also think of more complex objects like surfaces in three dimensions or Riemannian manifolds.)

One may also ask the question how two compare different images defined on different shapes, i.e. different “distributions of colour” on two different shapes.

2. Comparing shapes: Metric spaces

Up to now we tried to compare different measures, defined on the same set. At least to me it seems that both the Prokhorov and the Wasserstein metrics are suited to measure the similarity of measures and in fact, they do so somehow finer than the usual norm does.

Let’s try to go one step further and ask ourselves, how we could compare two measures {\mu_1} and {\mu_2} which are defined on two different sets? While thinking about an answer one need to balance several things:

  • The setup should be general enough to allow for the comparison of a wide range of objects.
  • It should include enough structure to allow meaningful statements.
  • It should lead to a measure which is easy enough to handle both analytically and computationally.

For the first and second bullet: We are going to work with measures not on arbitrary sets but on metric spaces. This will allow to measure distances between points in the sets and, as you probably know, does not pose a severe restriction. Although metric spaces are much more specific than topological spaces, we still aim at quantitative measures which are not provided by topologies. With respect to the last bullet: Note that both the Prokhorov and the Wasserstein metric are defined as infimums over fairly large and not too well structured sets (for the Prokhorov metric and need to consider all measurable sets and their {\epsilon}-neighborhoods, for the Wasserstein metric, one need to consider all measure couplings). While they can be handled quite well theoretically, their computational realization can be cumbersome.

In a similar spirit than Facundo Memoli’s paper we work our way up from comparing subsets of metric spaces up to comparing two different metric spaces with two measures defined on them.

2.1. Comparing compact subsets of a metric space: Hausdorff

Let {(X,d)} be a compact metric space. Almost hundred years ago Hausdorff introduced a metric on the family of all non-empty compact subsets of a metric space as follows: The Hausdorff metric of two compact subsets {A} and {B} of {X} is defined as

\displaystyle d_H(A,B) = \inf\{\epsilon>0 \ :\ A\subset B_\epsilon,\ B \subset A_\epsilon\}

(again, using the notion of {\epsilon}-neighborhood). This definition seems to be much in the spirit of the Prokhorov metric.

Proposition 2.1 in Facundo Memolis paper shows that the Hausdorff metric has an equivalent description as

\displaystyle d_H(A,B) = \inf_R \sup_{(a,b) \in R} d(a,b)

where the infimum is taken over all correspondences {R} of {A} and {B}, i.e., all subset {R\subset A\times B} such that for all {a\in A} there is {b\in B} such that {(a,b) \in R} and for all {b\in B} there {a\in A} such that {(a,b)\in R}. One may also say set coupling of {A} and {B} instead of correspondence.

Example 2 There is always the full coupling {R = A\times B}. Three different set couplings of two subsets {A} and {B} of the unit interval are shown here:


the “full one” {A\times B} in green and two “slim” ones in red and orange. Other “slim” couplings can be obtained from surjective mappings {f:A\rightarrow B} by {R = \{(a,f(a))\ :\ a\in A\}} (or with the roles of {A} and {B} swapped): If you couple a set {A} with itself, there is also the trivial coupling

\displaystyle R = \{(a,a)\ : \ a\in A\}

which is just the diagonal of {A\times A}

Note that the alternative definition of the Hausdorff metric is more in the spirit of the Wasserstein metric: It does not use enlarged objects (by {\epsilon}-neighborhoods) but couplings.

The Hausdorff metric is indeed a metric on the set {\mathfrak{C}(X)} of all non-empty compact subsets of a metric space {X} and if {X} itself is compact it even holds that {(\mathfrak{C}(X),d_H)} is a compact metric space (a result, known as Blaschke Selection Theorem).

One may say that we went up an abstraction ladder one step by moving from {(X,d)} to {(\mathfrak{C}(X),d_H)}.

2.2. Comparing compact metric spaces: Gromov-Hausdorff

In the previous subsection we worked within one metric space {X}. In the book “Metric Structures for Riemannian and Non-Riemannian Spaces” Misha Gromov introduced a notion to compare two different metric spaces. For compact metric space {X} and {Y} the Gromov-Hausdorff metric is defined as

\displaystyle d_{GH}(X,Y) = \inf_{Z,f,g} d_H(f(X),g(Y)) \ \ \ \ \ (2)

where the infimum is taken over

  • all metric spaces {Z} and
  • all isometric embeddings {f} and {g} which embed {X} and {Y} into {Z} respectively.

In words: To compute the Gromov-Hausdorff metric, you try embed both {X} and {Y} into a common larger space isometrically such that they are as close as possible according to the Hausdorff metric in that space.

Strictly speaking, the above definition is not well stated as one can not form an infimum over all metric spaces since this collection does not form a set according to the rules of set theory. More precisely one should write that {d_{GH}(X,Y)} is the infimum over all {r>0} such that there exists a metric space {Z} and isometric embeddings {f} and {g} of {X} and {Y}, respectively, such that {d_H(f(X),g(Y))<r}.

As the Hausdorff metric could be reformulated with set couplings there is a reformulation of the Gromov-Hausdorff metric based on metric couplings: A metric coupling of two metric spaces {(X,d_X)} and {(Y,d_Y)} is a metric {d} on the disjoint union {X\sqcup Y} of {X} and {Y} such that for all {x,x'\in X} and {y,y'\in Y} it holds that {d(x,x') = d_X(x,x')} and {d(y,y') = d_Y(y,y')}.

Example 3 We couple a metric space {(X,d)} with itself. We denote with {(X',d')} an identical copy of {(X,d)} and look for a metric {D} on {X\times X'} that respects the metrics {d} and {d'} in the way a metric coupling has to.

To distinguish elements from {X} and {X'} we put a {'} on all quantities from {X'}. Moreover, for {x\in X} we denote by {x'} its identical copy in {X'} (and similarly for {x'\in X'}, {x} is its identical twin). Then, for any {\epsilon>0} we can define {D_\epsilon(x,x') = D_\epsilon(x',x) = \epsilon} (i.e. the distance between any two identical twins is {\epsilon}. By the triangle inequality we get for {x\in X} and {y'\in X'} that {D_\epsilon(x,y')} should fulfill

\displaystyle D_\epsilon(x',y') - D_\epsilon(x',x) \leq D_\epsilon(x,y') \leq D_\epsilon(x,y) + D_\epsilon(y,y')

and hence

\displaystyle d(x,y) - \epsilon \leq D_\epsilon(x,y') \leq d(x,y) + \epsilon.

Indeed we can choose {D_\epsilon(x,y') = d(x,y)} if {x\in X} and {y'\in Y} leading to one specific metric coupling for any {\epsilon}. This couplings allow to distinguish identical twins and behave as a metric on the whole disjoint union. In the limiting case {\epsilon\rightarrow 0} we do not obtain a metric but a semi-metric or pseudo-metric which is just the same as a metric but without the assumption that {d(x,y) = 0} implies that {x=y}.

Example 4 The above example of a metric coupling of a metric space with itself was somehow “reproducing” the given metric as accurate as possible. There are also other couplings that put very different distances to points {D(x,y')} and there is also a way to visualize metric couplings: When building the disjoint union of two metric spaces {X} and {Y}, you can imagine this as isometrically embedding both in a larger metric space {Z} in a non-overlapping way and obtain the metric coupling {D} as the restriction of the metric on {Z} to {X\sqcup Y}. For {X=Y=[0,1]} you can embed both into {Z = {\mathbb R}^2}. A metric coupling which is similar (but not equal) to the coupling of the previous example is obtained by putting {X} and {Y} side by side at distance {\epsilon} as here (one space in green, the other in blue).


A quite different coupling is obtained by putting {X} and {Y} side by side, but in a reversed way as here:


You may even embed them in a more weired way as here:


but remember that the embeddings has to be isometric, hence, distortions like here are not allowed.

058_metric_coupling_embedding4This example illustrate that the idea of metric coupling is in similar spirit as of “embedding two spaces in a common larger one”.

With the notion of metric coupling, the Gromov-Hausdorff metric can be written as

\displaystyle d_{GH}(X,Y) = \inf_{R,d} \sup_{(x,y)\in R} d(x,y) \ \ \ \ \ (3)

where the infimum is taken over all set couplings {R} of {X} and {Y} and all metric couplings {d} of {(X,d_X)} and {(Y,d_Y)}.

In words: To compute the Gromov-Hausdorff metric this way, you look for a set coupling of the base sets {X} and {Y} and a metric coupling {d} of the metrics {d_X} and {d_Y} such that the maximal distance of two coupled points {x} and {y} is as small as possible. While this may look more complicated than the original definition from~(2), note that the original definition uses all metric spaces {Z} in which you can embed {X} and {Y} isometrically, which seems barely impossible to realize. Granted, the new definition also considers a lot of quantities.

Also note that this definition is in spirit of the Wasserstein metric from~(1): If there were natural measures {\mu_R} on the set couplings {R} we could write \begin{equation*} d_{GH}(X,Y) = \inf_{R,d} \Big(\int d(x,y)^pd\mu_R\Big)^{1/p} \end{equation*} and in the limit {p\rightarrow\infty} we would recover definition~(3).

Example 5 The Gromov-Hausdorff distance of a metric space {(X,d_X)} to itself is easily seen to be zero: Consider the trivial coupling {R = \{(x,x)\ :\ x\in X\}} from Example~2 and the family {D_\epsilon} of metric couplings from Example~3. Then we have {d_{GH}(X,X) \leq \epsilon} for any {\epsilon >0} showing {d_{GH}(X,X) = 0}. Let’s take one of the next-complicated examples and compute the distance of {X = [0,1]} and {Y=[0,2]}, both equipped with the euclidean metric. We couple the sets {X} and {Y} by {R = \{(x,2x)\ : \ x\in X\}} and the respective metrics by embedding {X} and {Y} into {{\mathbb R}^2} as follows: Put {Y} at the line from {(0,0)} to {(2,0)} and {X} at the line from {(\tfrac12,\epsilon)} to {(1\tfrac12,\epsilon)}:


This shows that {d_{GH}(X,Y) \leq \tfrac12} and actually, we have equality here.

There is another reformulation of the Gromov-Hausdorff metric, the equivalence of which is shown in Theorem 7.3.25 in the book “A Course in Metric Geometry” by Dmitri Burago, Yuri Burago and Sergei Ivanov:

\displaystyle d_{GH}(X,Y) = \tfrac12\inf_R \sup_{\overset{\overset{x_{1/2}\in X}{y_{1/2}\in Y}}{(x_i,y_i)\in R}}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big| \ \ \ \ \ (4)

where the infimum is taken over all set couplings {R} of {X} and {Y}.

In words: Look for a set coupling such that any two coupled pairs {(x_1,y_1)} and {(x_2,y_2)} have the “most equal” distance.

This reformulation may have the advantage over the form (3) in that is only considers the set couplings and the given metrics {d_X} and {d_Y} and no metric coupling is needed.

Note that, as the previous reformulation~(3), it is also in the spirit of the Wasserstein metric: If there were natural measures {\mu_R} in the set couplings {R}, we could write

\displaystyle d_{GH}(X,Y) = \tfrac12\inf_R \Big(\int_{R\times R}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big|^p d\mu_R(x_1,y_1)d\mu_R(x_2,y_2)\Big)^{1/p}.

and recover the formulation~(4) in the limit {p\rightarrow\infty}.

One may say that we went up an abstraction ladder one step further by moving from {(X,d)} to {(\mathfrak{C}(X),d_H)} to {(\text{All compact metric spaces},d_{GH})}.

Since this post has been grown pretty long already, I decided to do the next step (which is the already announced metric on metric spaces which additionally carry some measure on them – so-called metric measure spaces) in a later post.


Get every new post delivered to your Inbox.

Join 46 other followers