### Imaging

Currently I am at the SIAM Imaging conference in Hong Kong. It’s a great conference with great people at a great place. I am pretty sure that this will be the only post from here, since the conference is quite intense. I just wanted to report on two ideas that have become clear here, although, they are both pretty easy and probably already widely known, but anyway:

1. Non-convex + convex objective

There are a lot of talks that deal with optimization problems of the form

$\displaystyle \min_u F(u) + G(u).$

Especially, people try to leverage as much structure of the functionals ${F}$ and ${G}$ as possible. Frequently, there arises a need to deal with non-convex parts of the objective, and indeed, there are several approaches around that deal in one way or another with non-convexity of ${F}$ or even ${F+G}$. Usually, in the presence of an ${F}$ that is not convex, it is helpful if ${G}$ has favorable properties, e.g. that still ${F+G}$ is bounded from below, coercive or even convex again. A particularly helpful property is strong convexity of ${G}$ (i.e. ${G}$ stays convex even if you subtract ${\epsilon/2\|\cdot\|^2}$ from it). Here comes the simple idea: If you already allow ${F}$ to be non-convex, but only have a ${G}$ that is merely convex, but not strongly so, you can modify your objective to

$\displaystyle \underbrace{F(u) - \tfrac\epsilon2\|u\|^2}_{\leftarrow F(u)} + \underbrace{G(u) + \tfrac\epsilon2\|u\|^2}_{\leftarrow G(u)}$

for some ${\epsilon>0}$. This will give you strong convexity of ${G}$ and an ${F}$ that is (often) theoretically no worse than it used to be. It appeared to me that this is an idea that Kristian Bredies told me already almost ten years ago and which me made into a paper (together with Peter Maaß) in 2005 which got somehow delayed and published no earlier than 2009.

If your problem has the form

$\displaystyle \min_u F(u) + G(Ku)$

with some linear operator ${K}$ and both ${F}$ and ${G}$ are convex, it has turned out, that it is tremendously helpful for the solution to consider the corresponding saddle point formulation: I.e. using the convex conjugate ${G^*}$ of ${G}$, you write

$\displaystyle \min_u \max_v F(u) + \langle Ku, v\rangle -G^*(v).$

A class of algorithms, that looks like to Arrow-Hurwicz-method at first glance, has been sparked be the method proposed by Chambolle and Pock. This method allows ${F}$ and ${G}$ to be merely convex (no smoothness or strong convexity needed) and only needs the proximal operators for both ${F}$ and ${G^*}$. I also worked on algorithms for slightly more general problems, involving a reformulation of the saddle point problem as a monotone inclusion, with Tom Pock in the paper An accelerated forward-backward algorithm for monotone inclusions and I also should mention this nice approach by Bredies and Sun who consider another reformulation of the monotone inclusion. However, in the spirit of the first point, one should take advantage of all the available structure in the problem, e.g. smoothness of one of the terms. Some algorithm can exploit smoothness of either ${F}$ or ${G^*}$ and only need convexity of the other term. An idea, that has been used for some time already, to tackle the case if ${F}$, say, is a sum of a smooth part and a non-smooth part (and ${G^*}$ is not smooth), is, to dualize the non-smooth part of ${F}$: Say we have ${F = F_1 + F_2}$ with smooth ${F_1}$, then you could write

$\displaystyle \begin{array}{rcl} &\min_u\max_v F_1(u) + F_2(u) + \langle Ku, v\rangle -G^*(v)\\ & \qquad= \max_u \min_{v,w} F_1(u) + \langle u,w\rangle + \langle Ku, v\rangle -G^*(v) - F_2^*(w) \end{array}$

and you are back in business, if your method allows for sums of convex functions in the dual. The trick got the sticky name “dual transportation trick” in a talk by Marc Teboulle here and probably that will help, that I will not forget it from now on…

I fell a little bit behind on reporting on my new preprints. In this posts I’ll blog on two closely related ones; one of them already a bit old, the other one quite recent:

The papers are

As clear from the titles, both papers treat a similar method. The first paper contains all the theory and the second one has few particularly interesting applications.

In the first paper we propose to view several known algorithms such as the linearized Bregman method, the Kaczmarz method or the Landweber method from a different angle from which they all are special cases of another algorithm. To start with, consider a linear system

$\displaystyle Ax=b$

with ${A\in{\mathbb R}^{m\times n}}$. A fairly simple and old method to solve this, is the Landweber iteration which is

$\displaystyle x^{k+1} = x^k - t_k A^T(Ax^k-b).$

Obviously, this is nothing else than a gradient descent for the functional ${\|Ax-b\|_2^2}$ and indeed converges to a minimizer of this functional (i.e. a least squares solution) if the stepsizes ${t_k}$ fulfill ${\epsilon\leq t_k\leq 2\|A\|^{-2} - \epsilon}$ for some ${\epsilon>0}$. If one initializes the method with ${x^0=0}$ it converges to the least squares solution with minimal norm, i.e. to ${A^\dag b}$ (with the pseudo-inverse ${A^\dag}$).

A totally different method is even older: The Kaczmarz method. Denoting by ${a_k}$ the ${k}$-th row of ${A}$ and ${b_k}$ the ${k}$-th entry of ${b}$ the method reads as

$\displaystyle x^{k+1} = x^k - a_{r(k)}^T\frac{a_{r(k)}\cdot x^k - b_k}{\|a_{r(k)}\|_2^2}$

where ${r(k) = (k\mod m) +1}$ or any other “control sequence” that picks up every index infinitely often. This method also has a simple interpretation: Each equation ${a_k\cdot x = b_k}$ describes a hyperplane in ${{\mathbb R}^n}$. The method does nothing else than projecting the iterates orthogonally onto the hyperplanes in an iterative manner. In the case that the system has a solution, the method converges to one, and if it is initialized with ${x^0=0}$ we have again convergence to the minimum norm solution ${A^\dag b}$.

There is yet another method that solves ${Ax=b}$ (but now it’s a bit more recent): The iteration produces two sequences of iterates

$\displaystyle \begin{array}{rcl} z^{k+1} & = &z^k - t_k A^T(Ax^k - b)\\ x^{k+1} & = &S_\lambda(z^{k+1}) \end{array}$

for some ${\lambda>0}$, the soft-thresholding function ${S_\lambda(x) = \max(|x|-\lambda,0)\mathrm{sgn}(x)}$ and some stepsize ${t_k}$. For reasons I will not detail here, this is called the linearized Bregman method. It also converges to a solution of the system. The method is remarkably similar, but different from, the Landweber iteration (if the soft-thresholding function wouldn’t be there, both would be the same). It converges to the solution of ${Ax=b}$ that has the minimum value for the functional ${J(x) = \lambda\|x\|_1 + \tfrac12\|x\|_2^2}$. Since this solution of close, and for ${\lambda}$ large enough identical, to the minimum ${\|\cdot\|_1}$ solution, the linearized Bregman method is a method for sparse reconstruction and applied in compressed sensing.

Now we put all three methods in a joint framework, and this is the framework of split feasibility problems (SFP). An SFP is a special case of a convex feasibility problems where one wants to find a point ${x}$ in the intersection of multiple simple convex sets. In an SFP one has two different kinds of convex constraints (which I will call “simple” and “difficult” in the following):

1. Constraints that just demand that ${x\in C_i}$ for some convex sets ${C_i}$. I call these constraints “simple” because we assume that the projection onto each ${C_i}$ is simple to obtain.
2. Constraints that demand ${A_ix\in Q_i}$ for some matrices ${A_i}$ and simple convex sets ${Q_i}$. Although we assume that projections onto the ${Q_i}$ are easy, these constraints are “difficult” because of the presence of the matrices ${A_i}$.

If there were only simple constraints a very basic method to solve the problem is the methods of alternating projections, also known as POCS (projection onto convex sets): Simply project onto all the sets ${C_i}$ in an iterative manner. For difficult constraints, one can do the following: Construct a hyperplane ${H_k}$ that separates the current iterate ${x^k}$ from the set defined by the constraint ${Ax\in Q}$ and project onto the hyperplane. Since projections onto hyperplanes are simple and since the hyperplane separates we move closer to the constraint set and this is a reasonable step to take. One such separating hyperplane is given as follows: For ${x^k}$ compute ${w^k = Ax^k-P_Q(Ax^k)}$ (with the orthogonal projection ${P_Q}$) and define

$\displaystyle H_k = \{x\ : (A^Tw^k)^T\cdot x \leq (A^Tw^k)^T\cdot x^k - \|w^k\|_2^2\}.$

Illustration of projections onto convex sets and separating hyperplanes

Now we already can unite the Landweber iteration and the Kaczmarz method as follows: Consider the system ${Ax=b}$ as a split feasibility problem in two different ways:

1. Treat ${Ax=b}$ as one single difficult constraint (i.e. set ${Q=\{b\}}$). Some calculations show that the above proposed method leads to the Landweber iteration (with a special stepsize).
2. Treat ${Ax=b}$ as ${m}$ simple constraints ${a_i\cdot x = b_i}$. Again, some calculations show that this gives the Kaczmarz method.

Of course, one could also work “block-wise” and consider groups of equations as difficult constraints to obtain “block-Kaczmarz methods”.

Now comes the last twist: By adapting the term of “projection” one gets more methods. Particularly interesting is the notion of Bregman projections which comes from Bregman distances. I will not go into detail here, but Bregman distances are associated to convex functionals ${J}$ and by replacing “projection onto ${C_i}$ or hyperplanes” by respective Bregman projections, one gets another method for split feasibility problems. The two things I found remarkable:

• The Bregman projection onto hyperplanes is pretty simple. To project some ${x^k}$ onto the hyperplane ${H = \{x\ :\ a^T\cdot x\leq \beta\}}$, one needs a subgradient ${z^k\in\partial J(x^k)}$ (in fact an “admissible one” but for that detail see the paper) and then performs

$\displaystyle x^{k+1} = \nabla J^*(z^k - t_k a)$

(${J^*}$ is the convex dual of ${J}$) with some appropriate stepsize ${t_k}$ (which is the solution of a one-dimensional convex minimization problem). Moreover, ${z^{k+1} = z^k - t_k a}$ is a new admissible subgradient at ${x^{k+1}}$.

• If one has a problem with a constraint ${Ax=b}$ (formulated as an SFP in one way or another) the method converges to the minimum-${J}$ solution of the equation if ${J}$ is strongly convex.

Note that strong convexity of ${J}$ implies differentiability of ${J^*}$ and Lipschitz continuity of ${\nabla J}$ and hence, the Bregman projection can indeed be carried out.

Now one already sees how this relates to the linearized Bregman method: Setting ${J(x) = \lambda\|x\|_1 + \tfrac12\|x\|_2^2}$, a little calculation shows that

$\displaystyle \nabla J^*(z) = S_\lambda(z).$

Hence, using the formulation with a “single difficult constraint” leads to the linearized Bregman method with a specific stepsize. It turns out that this stepsize is a pretty good one but also that one can show that a constant stepsize also works as long as it is positive and smaller that ${2\|A\|^{-2}}$.

In the paper we present several examples how one can use the framework. I see one strengths of this approach that one can add convex constraints to a given problem without getting into any trouble with the algorithmic framework.

The second paper extends a remark that we make in the first one: If one applies the framework of the linearized Bregman method to the case in which one considers the system ${Ax=b}$ as ${m}$ simple (hyperplane-)constraints one obtains a sparse Kaczmarz solver. Indeed one can use the simple iteration

$\displaystyle \begin{array}{rcl} z^{k+1} & = &z^k - a_{r(k)}^T\frac{a_{r(k)}\cdot x^k - b_k}{\|a_{r(k)}\|_2^2}\\ x^{k+1} & = &S_\lambda(z^{k+1}) \end{array}$

and will converge to the same sparse solution as the linearized Bregman method.

This method has a nice application to “online compressed sensing”: We illustrate this in the paper with an example from radio interferometry. There, large arrays of radio telescopes collect radio emissions from the sky. Each pair of telescopes lead to a single measurement of the Fourier transform of the quantity of interest. Hence, for ${k}$ telescopes, each measurement gives ${k(k-1)/2}$ samples in the Fourier domain. In our example we used data from the Very Large Array telescope which has 27 telescopes leading to 351 Fourier samples. That’s not much, if one want a picture of the emission with several ten thousands of pixels. But the good thing is that the Earth rotates (that’s good for several reasons): When the Earth rotates relative to the sky, the sampling pattern also rotates. Hence, one waits a small amount of time and makes another measurement. Commonly, this is done until the earth has made a half rotation, i.e. one complete measurement takes 12 hours. With the “online compressed sensing” framework we proposed, one can start reconstructing the image as soon the first measurements have arrived. Interestingly, one observes the following behavior: If one monitors the residual of the equation, it goes down during iterations and jumps up when new measurements arrive. But from some point on, the residual stays small! This says that the new measurements do not contradict the previous ones and more interestingly this happened precisely when the reconstruction error dropped down such that “exact reconstruction” in the sense of compressed sensing has happened. In the example of radio interferometry, this happened after 2.5 hours!

Reconstruction by online compressed sensing

You can find slides of a talk I gave at the Sparse Tomo Days here.

Today I’d like to collect some comments one a few papers I stumbled upon recently on the arXiv.

1. TGV minimizers in 1D

First, about a month ago two very similar paper appeared in the same week:

Both papers treat the recently proposed “total generalized variation” model (which is a somehow-but-not-really-higher-order generalization of total variation). The total variation of a function ${u\in L^1(\Omega)}$ (${\Omega\subset{\mathbb R}^d}$) is defined by duality

$\displaystyle TV(u) = \sup\Big\{\int_\Omega \mathrm{div} \phi\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,{\mathbb R}^d), |\phi|\leq 1\Big\}.$

(Note that the demanded high regularity of the test functions ${\phi}$ is not essential here, as we take a supremum over all these functions under the only, but important, requirement that the functions are bounded. Test functions from ${C^1_c(\Omega,{\mathbb R}^d)}$ would also do.)

Several possibilities for extensions and generalization of the total variation exist by somehow including higher order derivatives. The “total generalized variation” is a particular successful approach which reads as (now using two non-negative parameter ${\alpha,\beta}$ which do a weighting):

$\displaystyle TGV_{\beta,\alpha}^2(u) = \sup\Big\{\int_\Omega \mathrm{div}^2 \phi\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,S^{d\times d}),\ |\phi|\leq \beta,\ |\mathrm{div}\phi|\leq \alpha\Big\}.$

To clarify some notation: ${S^{d\times d}}$ are the symmetric ${d\times d}$ matrices, ${\mathrm{div}^n}$ is the negative adjoint of ${\nabla^n}$ which is the differential operator that collects all partial derivatives up to the ${n}$-th order in a ${d\times\cdots\times d}$-tensor. Moreover ${|\phi|}$ is some matrix norm (e.g. the Frobenius norm) and ${|\mathrm{div}\phi|}$ is some vector norm (e.g. the 2-norm).

Both papers investigate so called denoising problems with TGV penalty and ${L^2}$ discrepancy, i.e. minimization problems

$\displaystyle \min_u \frac12\int_\Omega(u-u^0)^2\, dx + TGV_{\alpha,\beta}^2(u)$

for a given ${u^0}$. Moreover, both papers treat the one dimensional case and investigate very special cases in which they calculate minimizers analytically. In one dimension the definition of ${TGV^2}$ becomes a little more familiar:

$\displaystyle TGV_{\beta,\alpha}^2(u) = \sup\Big\{\int_\Omega \phi''\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,{\mathbb R}),\ |\phi|\leq \beta,\ |\phi'|\leq \alpha\Big\}.$

Some images of both papar are really similar: This one from Papafitsoros and Bredies

and this one from Pöschl and Scherzer

Although both paper have a very similar scopes it is worth to read both. The calculations are tedious but both paper try to make them accessible and try hard (and did a good job) to provide helpful illustrations. Curiously, the earlier paper cites the later one but not conversely…

Another paper I found very interesting was

This paper shows a nice duality which I haven’t been aware of, namely the one between the subgradient descent methods and conditional gradient methods. In fact the conditional gradient method which is treated is a generalization of the conditional gradient method which Kristian and I also proposed a while ago in the context of ${\ell^1}$-minimization in the paper Iterated hard shrinkage for minimization problems with sparsity constraints: To minimize the sum

$\displaystyle F(u) + \Phi(u)$

with a differentiable ${F}$ and a convex ${\Phi}$ for which the subgradient of ${\Phi}$ is easily invertible (or, put differently, for which you can minimize ${\langle u,a\rangle + \Phi(u)}$ easily), perform the following iteration:

1. At iterate ${u^n}$ linearize ${F}$ but not ${\Phi}$ and calculate a new point ${v^n}$ by

$\displaystyle v^n = \mathrm{argmin}_v \langle F'(u^n),v\rangle + \Phi(v)$

2. Choose a stepsize ${s^n\in [0,1]}$ and set the next iterate as a convex combination of ${u^n}$ and ${v^n}$

$\displaystyle u^{n+1} = u^n + s_n(v^n - u^n).$

Note that for and indicator function

$\displaystyle \Phi(u) = \begin{cases} 0 & u\in C\\ \infty & \text{else} \end{cases}$

you obtain the conditional gradient method (also known as Frank-Wolfe method). While Kristian and I derived convergence with an asymptotic rate for the case of ${F(u) = \tfrac12\|Ku-f\|^2}$ and ${\Phi}$ strongly coercive, Francis uses the formulation ${F(u) = f(Au)}$ the assumption that the dual ${f^*}$ of ${f}$ has a bounded effective domain (which say that ${f}$ has linear growth in all directions). With this assumption he obtains explicit constants and rates also for the primal-dual gap. It was great to see that eventually somebody really took the idea from the paper Iterated hard shrinkage for minimization problems with sparsity constraints (and does not think that we do heuristics for ${\ell^0}$ minimization…).

I found this draft of a post in my backlog and decided that it would be better to finish it than to leave it unpublished. Actually, the story happened already over a year ago.

Some month ago I stumbled upon this piece of shadow art

by Fred Eerdekens and a few days later I received the information that my university was going to celebrate his yearly “open house event” this year as “TU Night”. The theme of the TU Night was “Night, Light, Energy” and all member were invited to submit ideas for talks, experiments and exhibits.

The piece of shadow art prompted the question “If this weired piece of metal can cast this sentence as a shadow, wouldn’t it be possible to produce another piece of metal that can produce two different sentences, when illuminated from different directions” Together with the announcement of the upcoming TU Night I thought if one could even produce an exhibit like this.

Since I am by no means an artists, I looked around at my university and found that there is a Department of Architecture. Since architects are much closer to being artist than I am, I contacted the department and proposed a collaboration and well, Henri Greil proposed to have a joint seminar on this topic. Hence, this summer term I made the experience and worked with students of architecture.

In the end, the student produced very nice pieces of shadow art:

Although the exhibits produced interesting and unexpected shadows, no group of students could make it and produce two different shadows out of the same object.

However, some nice effects can be produced pretty easy:

The basic idea is that moving one object around will move around both shadows rather independently. Well this is not totally true but what you can do is to “zoom” one shadow while moving the other sideways (just move the object straight towards one light source). See this movie for a small illustration:

I also did my best to produce a more complex object. While it is theoretically not very difficult to see that some given shadows are possible in some given projection geometry, it is not at all straight forward to produce the object theoretically (not to speak of the real world problems while building the piece). It tried hard but I could not do better than this:

In this post I will explore a bit the question on how to calculate the discrete gradient and the discrete divergence of an image and a vector field, respectively.

Let ${u_0\in{\mathbb R}^{N\times M}}$ be a discrete image, i.e. ${u_0(i,j)}$ denotes the gray value at the ${i,j}$-th pixel. The famous total variation denoising amounts to minimizing the functional

$\displaystyle \Phi(u) = \tfrac12 \int (u-u_0)^2 + \lambda\int|\nabla u|$

where the integral shall be understood as summation, ${\nabla}$ stand for the gradient, i.e. the vector of the partial derivatives, and ${|\cdot|}$ stands for the euclidean absolute value.

When using primal-dual methods, it is of crucial importance, that the used operators for the gradient and the divergence are numerically adjoint (up to the minus sign). That is, the numerical operations are adjoint in the following sense: If grad is the operation for the gradient and div is the operation for the divergence, then it should hold for any variables ${u}$ and ${v}$ of suitable size and with gradu = grad(u), divv = div(v) that sum(gradu(:).*v(:)) and -sum(u(:).*divv(:)) are equal up to numerical precision. Due to the boundary treatment, the internal MATLAB operations gradient and divergence do not fulfill this requirement.

The most common discretization of the gradient uses discrete forward differences and a constant padding at the boundary (which means that Neumann boundary values are applied). In formula, this reads as

$\displaystyle (D_xu)_{i,j} = \begin{cases} u_{i+1,j} - u_{i,j} & i

The respective adjoints are backward differences with zero boundary treatment (check it!). Apparently, there are many different ways to implement this routines (and the respective adjoints) in MATLAB. Here are four of them:

1. For loops: Run through all pixels in two for-loops and assign the difference to the output. Of course, one should preallocate the output prior to the loop. But you may probably know the first two rules of MATLAB coding? If not, here they are: 1. Avoid for-loops. 2. Avoid for-loops, seriously. I put the routines into extra functions and created anonymous function to call the gradient and the divergence as
grad = @(u) cat(3,dxp(u),dyp(u));
div = @(V) dxm(V(:,:,1)) + dym(V(:,:,2));

2. Shift and subtract: MATLAB is great in using vectors and matrices. And to avoid the for-loop one could also implement the forward difference in ${x}$-direction by shifting the matrix and subtract the original one, i.e. [u(:,2:end) u(:,end)] - u (and similarly for the other differences). Again, I wrote extra functions and used anonymous function as above.
3. Small sparse matrices from the left and from the right: MATLAB is also pretty good with sparse matrices. Since the derivatives in ${x}$-direction only involve the subtraction of two elements in the same column, one can realize this by multiplying an image from the left with a sparse diagonal matrix with just two non-zero diagonals. Similarly, the derivative in ${y}$-direction can be realized by multiplying from the right with a suitable matrix. More precisely, this approach is realized by
Dy = spdiags([-ones(M,1) ones(M,1)],[0 1],M,M);
Dy(M,:) = 0;
Dx = spdiags([-ones(N,1) ones(N,1)],[0 1],N,N);
Dy(N,:) = 0;
Dxu = Dx*u;
Dyu = u*Dy';


(check it). Note that the adjoint of ${x}$-derivative if simple the operation Dx'*u and the operation of the ${y}$-derivative is u*Dy. Together, the calculation of the gradient and the divergence was done by the anonymous functions

grad = @(u) cat(3,u*Dx',Dy*u);
div = @(V) V(:,:,1)*Dx + Dy'*V(:,:,2);

4. Large sparse matrices: One could think about the following: Vectorize the image by U = u(:) (which amounts to stacking the columns above each other). Then assemble a large ${NM\times NM}$ sparse matrix which has just two non-zero diagonals to do the forward (and other) differences. More precisely this can be done by
(with Dx and Dy from above)

DX = kron(Dx,speye(M));
DY = kron(speye(N),Dy);
DxU = DX*U;
DyU = DY*U;


Here, it is clear that the respective adjoint are just the multiplication with the transposed matrices. Here, the anonymous functions are

grad = @(u) [DX*u DY*u];
div = @(V) DX'*V(:,1) + DY'*V(:,2);


The different approaches have different pros and cons. Well, the for-loop only has cons: It is presumably slow and uses a lot indexing which easily leads to bugs. The shift-and-subtract method should go into an extra function to make it easy to use – but this is not necessarily a drawback. For the multiplication with the large matrix, one has to vectorize the image first and every time one wants to look at the result and need to do reshape(U,N,M). But let’s focus on speed: I implemented all methods, and let the run on square images of different sizes for 50 times (after a warmup) and measures the execution times with tic and toc. The assembly of the matrices did not enter the timing. Moreover, no parallel computation was used – just a single core. Finally, memory was not an issue since the larges matrices (of size ${2500\times 2500\times 2}$) only need roughly 100MB of memory.

Here is the table with the average times for one calculation of the gradient (in seconds):

 ${N}$ For-loop Shift-subtract left-right left 100 0.0004 0.0002 0.0001 0.0003 200 0.0018 0.0010 0.0005 0.0015 300 0.0057 0.0011 0.0014 0.0020 400 0.0096 0.0031 0.0022 0.0035 500 0.0178 0.0035 0.0030 0.0054 750 0.0449 0.0114 0.0097 0.0123 1000 0.0737 0.0189 0.0128 0.0212 1500 0.2055 0.0576 0.0379 0.0601 2000 0.3942 0.0915 0.0671 0.1136 2500 0.6719 0.1571 0.1068 0.1788

and here is the one for the divergences:

 ${N}$ For-loop Shift-subtract left-right left 100 0.0004 0.0003 0.0002 0.0002 200 0.0018 0.0015 0.0005 0.0008 300 0.0048 0.0016 0.0015 0.0012 400 0.0090 0.0036 0.0020 0.0022 500 0.0158 0.0057 0.0027 0.0035 750 0.0409 0.0132 0.0073 0.0069 1000 0.0708 0.0238 0.0130 0.0125 1500 0.2008 0.0654 0.0344 0.0370 2000 0.3886 0.1285 0.0622 0.0671 2500 0.6627 0.2512 0.1084 0.1361

As expected, the for-loop is clearly slower and also as expected, all methods basically scale quadratically (doubling ${N}$ amounts to multiplying the running time by four) since the work per pixel is constant. A little surprisingly, the multiplication from the left and from the right is fastest and also consistently a little faster then multiplication from the left with the larger sparse matrix. I don’t know, why the results are that different for the gradient and for the divergence. Maybe this is related to my use on anonymous functions or the allocation of memory?

The second day of SSVM started with an invited lecture of Tony Lindeberg, who has written one very influential and very early book about scale space theory. His talk was both a tour through scale space and a recap of the recent developments in the field. Especially he show how the time aspect could be incorporated into scale space analysis by a close inspection of how receptive fiels are working. There were more talks but I only took notes from the talk of Jan Lellmann who talked about the problem of generating an elevation map from a few given level lines. One application of this could be to observe coastlines at different tides and then trying the reconstruct the full height map at the coast. One specific feature here is that the surface one looks for may have ridges which stem from kinks in the level lines and these ridges are important features of the surface. He argued that a pure convex regularization will not work and proposed to use more input namely a vector field which is derived from the contour lines such that the vector somehow “follows the ridges”, i.e. it connects to level lines in a correct way.

Finally another observation I had today: Well, this is not a trend, but a notion which I heard for the first time here but which sounds very natural is the informal classification of data terms in variational models as “weak” or “strong”. For example, a denoising data term ${\|u-u^0\|^2_{L^2(\Omega)}}$ is a strong data term because it gives tight information on the whole set ${\Omega}$. On the other hand, an inpainting data term ${u|_{\Omega\setminus\Omega'} = u^0|_{\Omega\setminus\Omega'}}$ is a weak data term because it basically tell nothing within the region ${\Omega'}$.

For afternoon the whole conference has been on tour to three amazing places:

• the Riegersburg, which is not only an impressive castle but also features interesting exhibitions about old arm and witches,
• the Zotter chocolate factory where they make amazing chocolate in mind-boggling varieties,
• and to Schloss Kronberg for the conference dinner (although it was pretty tough to start eating the dinner after visiting Zotter…).

I am at the SSVM 13 and the first day is over. The conference is in a very cozy place in Austria, namely the Schloss Seggau which is located on a small hill near the small town Leibnitz in Styria (which is in Austria but totally unaffected by the floods). The conference is also not very large but there is a good crowd of interesting people here and the program reads interesting.

The time schedule is tight and I don’t know if I can make it to post something everyday. Especially, I will not be able to blog about every talk.

From the first day I have the impression that two things have appeared frequently at different places:

• Differential geometry: Tools from differential geometry like the notion of geodesics or Riemannian metrics have not only been used by Martin Rumpf to model different types of spaces of shapes but also to interpolate between textures.
• Primal dual methods: Variational models should be convex these days and if they are not you should make them convex somehow. Then you can use primal dual methods. Somehow the magic of primal dual methods is that they are incredibly flexible. Also they work robustly and reasonably fast.

Probably, there are more trends to see in the next days.

Another few notes to myself:

This term I am particularly busy as I am writing a book about variational methods in imaging. The book will be a textbook and I am writing it parallel to a lecture I am currently teaching. (And it is planned that Kristian Bredies, the co-author, teaches the same stuff next term – then there will be another few month of editing, so it will be at least a year until publishing.)

In the book we will treat variational approaches to a variety of basic imaging problems. Of course we treat denoising and deblurring but there will also be sections about image interpolation, segmentation and optical flow. In the first part of the book, we present the variational problem and model them properly in Lebesgue and Sobolev spaces and of course in the space ${BV}$. Some effort goes into the analysis of the models and the first step is usually to establish existence of solutions, i.e. minimizers of the respective minimization problems. The work horse is the direct method in the calculus of variations and we mainly use the method for convex functionals in Banach spaces.

When I started the section on optical flow I noticed that I hadn’t thought about existence of minimizers before and moreover, most papers and books do not treat this issue. Let’s recall the method of Horn and Schunck to calculate the optical flow:

For two images ${u_0}$, ${u_1}$ defined on a domain ${\Omega\subset{\mathbb R}^2}$ one seeks a flow field ${V:\Omega\rightarrow{\mathbb R}^2}$ such that ${V}$ the describes the apparent motion that has happened between both images. Assuming that the points keep their gray value during motion (an assumption known as the brightness constancy constraint) and linearizing this assumption one arrives at the condition

$\displaystyle \frac{u_1-u_0}{dt} + V\cdot\nabla u_0 = 0$

(where ${dt}$ is the time between the images ${u_0}$ and ${u_1}$). First, this does not give enough equations to determine ${V}$ and secondly, points with ${\nabla u_0=0}$ are problematic.

Horn and Schunck proposed to loose the constraint and to enforce some smoothness of the flow field ${V}$: Their model was to minimize

$\displaystyle F(u) = \int_\Omega\big(\frac{u_1-u_0}{dt} + V\cdot\nabla u_0\big)^2{\mathrm d}{x} + \lambda\int_\Omega|\nabla V|^2{\mathrm d}{x}$

for some parameter ${\lambda}$ weighting smoothness of ${V}$ (large ${\lambda}$) against the brightness constancy constraint (small ${\lambda}$). A little bit more general one could choose exponents ${p}$ and ${q}$ and minimize

$\displaystyle F(u) = \int_\Omega\big|\frac{u_1-u_0}{dt} + V\cdot\nabla u_0\big|^q{\mathrm d}{x} + \lambda\int_\Omega|\nabla V|^p{\mathrm d}{x}.$

To apply the direct method to obtain existence of minimizers of ${F}$ one ensures

1. properness, i.e. there is some ${V}$ such that ${F}$ is finite,
2. convexity of ${F}$,
3. lower semi-continuiuty of ${F}$ and
4. coercivity of ${F}$.

To check these things one has to choose an appropriate space to work in. It seems reasonable to choose ${V\in L^{q}(\Omega,{\mathbb R}^2)}$. Then properness of ${F}$ is easy (consider ${V=0}$, of course assuming that ${u_1-u_0\in L^q(\Omega)}$). Convexity is also clear and for lower semi-continuity one has to work a little more, but that is possible if, e.g., ${\nabla u_0}$ is bounded. Coercivity is not that clear and in fact ${F}$ is not coercive in general.

Example 1 (Non-coercivity of the Horn-and-Schunck-model) Simply consider ${u_0(x,y) = ax + by}$ for some ${a,b\in{\mathbb R}}$. Then ${\nabla u(x,y) \equiv [a\ b]^T}$. Set ${V_n(x,y) \equiv [-nb\ na]^T}$ and note that ${\|V^n\|_q\rightarrow\infty}$ while ${F(V^n)}$ stays bounded (in fact constant).

I just checked the book “Mathematical problems in Imaging” by Gilles Aubert and Pierre Kornprobst and in Section 5.3.2 they mention that the Horn and Schunck model is not coercive. They add another term to ${F}$ which is roughly a weighted norm of ${V}$ which ensures coercivity. However, it turns out that coercivity of ${F}$ is true under a mild assumption of ${u_0}$. The idea can be found in a pretty old paper by Christoph Schnörr which is called “ Determining Optical Flow for Irregular Domains by Minimizing Quadratic Functionals of a Certain Class” (Int. J. of Comp. Vision, 6(1):25–38, 1991). His argument works for ${q=2}$:

Theorem 1 Let ${\Omega\subset{\mathbb R}^2}$ be a bounded Lipschitz domain, ${u_0,u_1\in L^2(\Omega)}$ with ${\nabla u_0\in L^\infty(\Omega)}$ such that ${\partial_x u_0}$ and ${\partial_y u_0}$ are linearly independent in ${L^2(\Omega)}$ and let ${1. Then it holds that ${F:L^2(\Omega)\rightarrow {\mathbb R}\cup\{\infty\}}$ defined by

$\displaystyle F(u) = \int_\Omega\big(\frac{u_1-u_0}{dt} + V\cdot\nabla u_0\big)^2{\mathrm d}{x} + \lambda\int_\Omega|\nabla V|^2{\mathrm d}{x}$

is coercive.

Proof: Now consider ${V^n}$ such that ${\|V^n\|_2\rightarrow\infty}$. Now we decompose the components of ${V}$ into the constant parts ${QV^n_x}$ and ${QV^n_y}$ and the “zero-mean”-part ${PV^n_x = V^n_x - QV^n_x}$ and ${PV^n_y = V^n_y - QV^n_y}$. First consider that ${PV^n}$ is unbounded, i.e. there is subsequence (also denoted by ${V^n}$) such that ${\|PV^n\|_2\rightarrow\infty}$. By Sobolev embedding and the \href{http://en.wikipedia.org/wiki/Poincar inequality}, we get that ${\int_\Omega|\nabla V^n|^p{\mathrm d}{x}\rightarrow\infty}$.

Now consider bounded ${PV^n}$ and hence, unbounded mean values ${QV^n}$. Using a subsequence, we assume that ${QV^n\rightarrow\infty}$. Now we use

$\displaystyle \Big\|\frac{u_1 - u_0}{\Delta t} + V\cdot \nabla u_0\Big\|_2 \geq \Big\|QV\cdot\nabla u_0\Big\|_2 - \Big\|\frac{u_1 - u_0}{\Delta t} + PV\cdot \nabla u_0\Big\|_2 \ \ \ \ \ (1)$

and estimate the first term from below, noticing that ${QV_x}$ and ${QV_y}$ are constants, by

$\displaystyle \begin{array}{rcl} \|QV\cdot\nabla u_0\|_2^2 & = &\|QV_x\,\partial_x u_0 + QV_y\,\partial_y u_0\|_2^2\\ & = & \|QV_x\,\partial_x u_0\|_2^2 + \|QV_y\,\partial_y u_0\|_2^2 + 2\langle QV_x\,\partial_x u_0,QV_y\,\partial_y u_0\rangle\\ & \geq &|QV_x|^2\|\partial_x u_0\|_2^2 + |QV_y|^2\|\partial_y u_0\|_2^2\\ &&\qquad - \|QV_x\,\partial_xu_0\|_2\|QV_y\,\partial_yu_0\|_2\,2\frac{|\langle \partial_x u_0,\partial_y u_0\rangle|}{\|\partial_xu_0\|_2\|\partial_y u_0\|_2}\\ & \geq &(|QV_x|^2\|\partial_x u_0\|_2^2 + |QV_y|^2\|\partial_y u_0\|_2^2) \Big(1 - \frac{|\langle \partial_x u_0,\partial_y u_0\rangle|}{\|\partial_xu_0\|_2\|\partial_y u_0\|_2}\Big). \end{array}$

Since ${\partial_x u_0}$ and ${\partial_y u_0}$ are linearly independent, it holds that ${1 - \frac{|\langle \partial_x u_0,\partial_y u_0\rangle|}{\|\partial_xu_0\|_2\|\partial_y u_0\|_2}>0}$ and we conclude that ${\|QV^{n_k}\|_2\rightarrow\infty}$ implies that ${\|QV^{n_k}\cdot\nabla u_0\|_2^2\rightarrow\infty}$. Together with~(1) and boundedness of ${PV^{n_k}}$ we obtain that ${F(V^{n_k})\rightarrow\infty}$. Since for every subsequence of ${V^n}$ we get another subsequence ${V^{n_k}}$ such that ${F(V^{n_k})\rightarrow\infty}$, the same conclusion holds for the whole sequence, showing coercivity of ${F}$. $\Box$

Basically the same arguments works for ${TV}$ optical flow, i.e. coercivity of

$\displaystyle F(u) = \int_\Omega\big(\frac{u_1-u_0}{dt} + V\cdot\nabla u_0\big)^2{\mathrm d}{x} + \lambda TV(V).$

However, I do not know yet what happens for ${q\neq 2}$ and if the result on coercivity is “sharp” in the sense that linear independence of ${\partial_x u_0}$ and ${\partial_y u_0}$ is necessary. Also, I don’t know yet what is true in dimensions higher than ${2}$.

Here I continue my previous post on methods to compare shapes. In that post we started with different metrics between two probability measures ${\mu_1}$ and ${\mu_2}$ defined on the same set ${X}$, namely the Prokhorov metric and the Wasserstein metric. Then we also had a look on the Hausdorff metric which measures the distance between two compact subsets ${A}$ and ${B}$ of a metric space ${(X,d)}$ and finally considered the Gromov-Hausdorff metric between two metric spaces ${(X,d_X)}$ and ${(Y,d_Y)}$. Our tools have been

• set couplings of ${X}$ and ${Y}$, i.e. subsets ${R\subset X\times Y}$ such that for all ${x\in X}$ there ${y\in Y}$ such that ${(x,y)\in R}$ and for all ${y\in Y}$ there is ${x\in X}$ such that ${(x,y)\in R}$,
• metric couplings of ${d_X}$ and ${d_Y}$, i.e. metrics ${d}$ on the disjoint union of ${X}$ and ${Y}$ such that ${d(x,x') = d_X(x,x')}$ if ${x}$ and ${x'}$ are in ${X}$ and ${d(y,y') = d_Y(y,y')}$ if ${y}$ and ${y'}$ are in ${Y}$. In fact, one could also work with semi-metrics ${d}$, i.e. they do not need to be positive definite, and
• measure couplings of ${\mu_X}$ and ${\mu_Y}$, i.e. measures ${\nu}$ on ${X\times Y}$ such that ${\nu(A\times Y) = \mu_X(A)}$ and ${\nu(X\times B) = \mu_Y(B)}$ for all ${\mu_X}$-/${\mu_Y}$-measurable set ${A}$ and ${B}$, respectively.

Now we make the next (and final) step and compare metric spaces ${(X,d_X)}$ and ${(Y,d_Y)}$ which are both equipped with a measure. These objects are known as metric measure spaces or mm-spaces and are formally defined as follows:

Definition 1 (mm-space) A metric measure space is a tripel ${(X,d_X,\mu_X)}$ consisting a compact metric space ${(X,d_X)}$ and a Borel probability measure ${\mu_X}$ on ${X}$.

Note that sometimes it is included that ${\mu_X}$ has full support (i.e., equal to ${X}$) in the definition of an mm-space, but it seems that not everybody does it like that.

1. Comparing mm-spaces: Gromov-Wasserstein

Our question is: How to we compare two mm-spaces ${(X,d_X,\mu_X)}$ and ${(Y,d_Y,\mu_Y)}$? The plan is simply to augment the previous versions of the Gromov-Hausdorff distances defined in my previos post here and here by something which takes the measures on the respective metric spaces into account. We recall both formulations of the Gromov-Hausdorff distance: The first is

$\displaystyle d_{GH}(X,Y) = \inf_{R,d} \sup_{(x,y)\in R} d(x,y) \ \ \ \ \ (1)$

where the infimum is taken over all set couplings ${R}$ of ${X}$ and ${Y}$ and metric couplings ${d}$ of ${d_X}$ and ${d_Y}$, and the second is

$\displaystyle d_{GH}(X,Y) = \tfrac12\inf_R \sup_{\overset{\overset{x_{1/2}\in X}{y_{1/2}\in Y}}{(x_i,y_i)\in R}}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big| \ \ \ \ \ (2)$

where the infimum is also taken over all set couplings ${R}$ of ${X}$ and ${Y}$.

Basically, we have already seen what we should do to build a metric between mm-spaces. The idea is: if there were some “natural measure” ${\mu_R}$ for any set coupling ${R}$ of ${X}$ and ${Y}$, then we would simply define the distance between ${(X,d_X,\mu_X)}$ and ${(Y,d_Y,\mu_Y)}$ as

$\displaystyle d_{GH}(X,Y) = \inf_{R,d} \Big(\int d(x,y)^pd\mu_R\Big)^{1/p}$

(as a generalization of (1)) and

$\displaystyle d_{GH}(X,Y) = \tfrac12\inf_R \Big(\int_{R\times R}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big|^p d\mu_R(x_1,y_1)d\mu_R(x_2,y_2)\Big)^{1/p}$

(as a generalization of (2)). In both cases we can have ${1\leq p<\infty}$. Note that the obvious modification for ${p=\infty}$ leads to something very similar to the Gromov-Hausdorff metrics.

But indeed there are “natural measures” on the set couplings of ${X}$ and ${Y}$: At least for the full coupling ${R= X\times Y}$ there are the measure couplings ${\nu}$ of ${\mu_X}$ and ${\mu_Y}$! (One does not need to consider smaller set couplings ${R\subsetneq X\times Y}$ since this can be taken into account by the measure couplings; they do not need to have full support anyway.)

Applying this idea to the version (1) of the Gromov-Hausdorff metric we arrive at the following expression, which can be called Gromov-Wasserstein metric,

$\displaystyle d_{Gw}^1(X,Y) = \inf_{\nu,d} \Big(\int_{X\times Y} d(x,y)^pd\nu(x,y)\Big)^{1/p} \ \ \ \ \ (3)$

where the infimum is taken over all measure couplings ${\nu}$ of ${\mu_X}$ and ${\mu_Y}$ and all metric couplings ${d}$ of ${d_X}$ and ${d_Y}$.

Starting from the version (2) of the Gromov-Hausdorff metric we arrive at another formulation:

$\displaystyle d_{GW}^2(X,Y) = \tfrac12\inf_\nu \Big(\int_{X\times Y}\int_{X\times Y}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big|^p d\nu(x_1,y_1)d\nu(x_2,y_2)\Big)^{1/p} \ \ \ \ \ (4)$

where the infimum is taken over all measure couplings ${\nu}$ of ${\mu_X}$ and ${\mu_Y}$.

While both versions of the Gromov-Hausdorff metric for compact metric spaces where equal, the same is not true for both generalizations to mm-spaces: In his paper Memoli proves that ${d_{GW}^2\leq d_{GW}^1}$ and gives an example (right after Remark 5.14) where strict inequality holds.

2. Comparing mm-spaces: Gromov-Prokhorov

Instead of starting from the Gromov-Hausdorff distance between metric spaces and augmenting their definition with something that takes the measures into account, we could also start from the “>Prokhorov metric between probability measures and augment the definition with something that takes the metric into account. In fact there are also two possibilities to do so: In the appendix of his paper, Memoli quotes this version (from this paper by Greven, Pfaffelhuber and Winter) of a metric between mm-spaces which we call Gromov-Prokhorov metric

$\displaystyle d_{GP}^1(X,Y) = \inf_{\nu,d}\Big\{\epsilon>0\ :\ \nu\{(x,y)\ :\ d(x,y)\geq\epsilon\}\leq\epsilon\Big\} \ \ \ \ \ (5)$

where the infimum is taken over all measure couplings ${\nu}$ of ${\mu_X}$ and ${\mu_Y}$ and all metric couplings ${d}$ of ${d_X}$ and ${d_Y}$.

The next version (also from the same paper by Greven, Pfaffelhuber and Winter where it was called Eurandom metric) is

$\displaystyle d_{GP}^2(X,Y) = \inf_\nu\Big\{\epsilon>0\ :\ \nu\otimes\nu\{(x_1,y_1,x_2,y_2)\ :\ |d_X(x_1,x_2) - d_Y(y_1,y_2)|\geq\epsilon\}\leq\epsilon\Big\} \ \ \ \ \ (6)$

where the infimum is taken over all measure couplings ${\nu}$ only.

3. A very simple example

The calculation of the proposed metrics by hand can be quite cumbersome. Let’s look at the simplest example.

We consider metric spaces ${X,Y\subset{\mathbb R}^d}$ (with the euclidean metric) accompanied with the measures ${\mu_X=\delta_{x_0}}$ and ${\mu_Y = \delta_{y_0}}$ for some points ${x_0\in X}$ and ${y_0\in Y}$. In this case the is only one measure coupling of ${\mu_X}$ and ${\mu_Y}$, namely

$\displaystyle \nu = \delta_{(x_0,y_0)}.$

Now it is easy to calculate the variant (4) of the Gromov-Wasserstein metric:

$\displaystyle \begin{array}{rcl} d_{GW}^2(X,Y) &=& \tfrac12\Big(\int_{X\times Y}\int_{X\times Y}\big| |x_1-x_2| - |y_1 -y_2|\big|^p d\delta_{(x_0,y_0)}(x_1,y_1)d\delta_{(x_0,y_0)}\Big)^{1/p} \\ &=& \tfrac12\Big(\int_{X\times Y}\big| |x_0-x_2| - |y_0 -y_2|\big|^pd\delta_{(x_0,y_0)}\Big)^{1/p} \\ &=& 0. \end{array}$

Let’s have a look at the variant (3): Since there is only one measure coupling, the metric is

$\displaystyle \begin{array}{rcl} d_{GW}^1(X,Y) & = &\inf_d\Big(\int_{X\times Y} d(x,y)^pd\delta_{(x_0,y_0)}(x,y)\Big)^{1/p} \\ & = &\inf_d d(x_0,y_0). \end{array}$

As we have learned in Example 4 in the previous post, we can find a metric coupling of ${X}$ and ${Y}$ that brings the points ${x_0}$ in ${X}$ and ${y_0}$ in ${Y}$ arbitrarily close together (by embedding both ${X}$ and ${Y}$ into some ${{\mathbb R}^n}$ such that these points are only ${\epsilon}$-far away from each other). Hence, we see that we have

$\displaystyle d_{GW}^1(X,Y) = 0$

similarly to ${d_{GW}^2}$.

Now let’s look at the Gromov-Prokhorov metric from (5). Again we only have one measure coupling and we get

$\displaystyle d_{GP}^1(X,Y) = \inf\Big\{\epsilon>0\ :\ \exists d\ \text{s.t}\ \delta_{(x_0,y_0)}\{(x,y)\ :\ d(x,y)\geq\epsilon\}\leq\epsilon\Big\}.$

Since the measure coupling is a Dirac, we can evaluate

$\displaystyle \delta_{(x_0,y_0)}\{(x,y)\ :\ d(x,y)\geq\epsilon\} = \begin{cases} 1 & d(x_0,y_0)\geq \epsilon\\ 0 & d(x_0,y_0)<\epsilon. \end{cases}$

As observed previously, there are metric couplings which bring the points ${x_0}$ and ${y_0}$ arbitrarily close together, and hence for any ${\epsilon>0}$ there is a metric coupling such that ${\delta_{(x_0,y_0)}\{(x,y)\ :\ d(x,y)\geq\epsilon\} = 0}$ which shows that

$\displaystyle d_{GP}^1(X,Y) = 0.$

Finally, consider the second variant of the Gromov-Prokhorov metric from (6). Since we only have the one measure coupling ${\nu = \delta_{(x_0,y_0)}}$ we have

$\displaystyle d_{GP}^2(X,Y) = \inf\Big\{\epsilon>0\ :\ \delta_{(x_0,y_0)}\otimes\delta_{(x_0,y_0)}\{(x_1,y_1,x_2,y_2)\ :\ \big| |x_1-x_2| - |y_1-y_2|\big|\geq\epsilon\}<\epsilon\Big\}.$

Evaluating the tensored Dirac delta is easy: We have that ${\delta_{(x_0,y_0)}\otimes\delta_{(x_0,y_0)}\{(x_1,y_1,x_2,y_2)\ :\ \big| |x_1-x_2| - |y_1-y_2|\big|\geq\epsilon\}}$ is either one or zero and it is one if and only if the point ${(x_0,y_0,x_0,y_0)}$ is in the set that is measured. However, we have that ${\big| |x_0-x_0| - |y_0-y_0|\big|}$ is, of course, always zero (and never larger than any ${\epsilon >0}$). Hence, the measure is always zero and hence, this version of the Gromov-Prokhorov distance also gives

$\displaystyle d_{GP}^2(X,Y) = 0.$

Note that all four metrics can not see that the two Diracs are at different points. The reason seems to be, that one can “deform the metric space outside of the support of the measures” arbitrarily, in some sense.

It seems, that the computation of ${d_{GW}^2}$ and ${d_{GP}^2}$ were easier, since one needs to know all measure couplings and no metric couplings has been involved.

4. Next difficult example: Compare some mm-space to a point

Ok, we have seen that mm-spaces which carry their measure in just a single point all look alike in both versions of the Gromov-Wasserstein metric and also in both versions of the Gromov-Prokhorov metric. Let’s look at a slightly more difficult example: We consider some mm-space ${(X,d_X,\mu)}$ and want to calculate its distance to a single point, i.e. the mm-space ${Y=\{0\}}$ with the only possible metric and measure. This should somehow measure the “size” or “spread” of the mm-space ${X}$.

First, we need to know all measure couplings and all metric couplings between these spaces. The measure couplings are very easy: There is just one, namely

$\displaystyle \nu = \mu\otimes \delta$

(i.e. all subsets of ${X\times \{0\}}$ are treated as if they were subsets of ${X}$). Concerning metric couplings, there are a few more. We allow semi-metrics ${d}$ on the disjoint union ${X\sqcup\{0\}}$: Since ${d}$ should respect the metric ${d_X}$ on ${X}$ we see that all metric couplings are parametrized by the points ${x_0}$ in ${X}$ by identifying ${0}$ (the element in ${Y}$) with this point ${x_0}$, i.e. all metric couplings are of the form ${d_{x_0}}$ defined by

$\displaystyle d_{x_0}(x,y) = d_X(x,y)\ (\text{for}\ x,y\in X),\qquad d_{x_0}(x,0) = d_X(x,x_0).$

(This only gives semi-metrics since we have ${d_{x_0}(x_0,0)=0}$ although, formally ${x_0\not 0}$.)

Let’s calculate the first Gromov-Wasserstein metric: There is only one measure coupling and we use the parametrization of the metric couplings to deduce

$\displaystyle \begin{array}{rcl} d_{GW}^1(X,\{0\}) &=& \inf_d\Big(\int_{X\times\{0\}} d(x,0)^pd(\mu\otimes \delta)(x,y)\Big)^{1/p}\\ &=& \inf_{x_0\in X}\Big(\int_{X} d_X(x,x_0)^pd\mu\Big)^{1/p}. \end{array}$

This quantity seems to be known as “minimal ${p}$-th central moment” of ${(X,d_X,\mu)}$.

The second variant of the Gromov-Wasserstein metric is (remember, there is only one measure coupling)

$\displaystyle \begin{array}{rcl} d_{GW}^2(X,\{0\}) &=& \tfrac12\Big(\int_{X\times\{0\}}\int_{X\times\{0\}} |d_X(x_1,x_2) - d_Y(0,0)|^p d(\mu\times\delta)(x_1,y_1) d(\mu\times\delta)(x_2,y_2) \Big)^{1/p}\\ &=& \tfrac12\Big(\int_X\int_X d_X(x_1,x_2)^p d\mu d\mu\Big)^{1/p}. \end{array}$

This quantity (without the factor ${1/2}$) is called the “${p}$-diameter” of ${(X,d_X)}$.

Let’s turn to the Gromov-Prokhorov metrics. The first one is (remember, that the metric couplings ${d}$ are parametrized by the points in ${X}$)

$\displaystyle \begin{array}{rcl} d_{GP}^1(X,\{0\}) &=& \inf_{d}\Big\{\epsilon>0\ :\ (\mu\otimes\delta)\{(x,0)\ :\ d(x,0)\geq\epsilon\}\leq\epsilon\Big\}\\ &=& \inf_{x_0\in X}\Big\{\epsilon>0\ :\ \mu\{x\ :\ d_X(x,x_0)\geq\epsilon\}\leq\epsilon\Big\}. \end{array}$

If this looks familiar, then you may have encountered the Ky-Fan metric already? The Ky-Fan metric is a metric between random variables ${\xi_1}$ and ${\xi_2}$ defined of the same probability space ${(X,\mu)}$ with values in a metric space with metric ${d}$. It reads as

$\displaystyle d_{KF}(\xi_1,\xi_2) = \inf\Big\{\epsilon>0\ :\ \mu\{\omega\ :\ d(\xi_1(\omega),\xi_2(\omega))\geq\epsilon\}\leq\epsilon\Big\}.$

Hence, the first version of the Gromov-Prokhorov metric is

$\displaystyle d_{GP}^1(X,\{0\}) = \inf_{x_0\in X}\ d_{KF}(\mathrm{id},x_0),$

i.e., the minimal Ky-Fan metric between the identity mapping and the constant mappings. (In other words, it measures how far the identity is from the constant mappings in the sense of Ky Fan.)

The second variant of the Gromov-Prokhorov metric is (remember, the only measure coupling is ${\nu = \mu\times \delta}$)

$\displaystyle \begin{array}{rcl} d_{GP}^2(X,\{0\}) &=& \inf \Big\{\epsilon>0\ :\ \nu\otimes\nu\{(x_1,0,x_2,0)\ :\ |d_X(x_1,x_2) -0|\geq\epsilon\}\leq\epsilon\Big\}\\ &=& \inf\Big\{\epsilon>0\ :\ \mu\otimes\mu\{(x_1,x_2)\ :\ d_X(x_1,x_2)\geq\epsilon\}\leq\epsilon\Big\}. \end{array}$

I do not have a neat name or a good intuition for this metric yet (although it also looks like it measures “size” or “non-localization” of ${X}$ in some sense). If you have one, let me know!

Next Page »