In sparse recovery one aims to leverage the sparsity of some vector ${u}$ to get a good reconstruction of this vector from a noisy and indirect measurement ${u^{0}}$. Sparsity can be expressed in two different ways: The analysis way and the synthesis way. For both of them you need a system of other vectors ${v_{k}}$.

• The analysis way: Here you consider the inner products ${\langle u,v_{k}\rangle}$ and the sparsity assumption is, that this sequence is sparse.
• The synthesis way: Here you assume that there is sparse sequence of coefficients ${\psi_{k}}$ such that ${u=\sum_{k} \psi_{k}v_{k}}$.

The first approach is called “analysis” since here one analyzes the vector ${u}$ with the system of vectors. In the second approach, you use the vectors ${v_{k}}$ to synthesize the vector ${u}$.

We will keep things very simple here and only consider a noisy observation (and not an indirect one). The recovery approach in the analysis way is then to find ${u}$ as a solution of

$\displaystyle \tfrac12\|u-u_{0}\| + \alpha R(\langle u,v_{k}\rangle)$

while for the synthesis way you write ${u = \sum_{k} \psi_{k}v_{k} = V\psi}$ (where we collected the vectors ${v_{k}}$ as the columns of the matrix ${V}$) and solve

$\displaystyle \tfrac12\|V\psi-u_{0}\| + \alpha R(\psi)$

The functional ${R}$ should enforce the sparsity of the vector of coefficients in the former case and the sparsity of ${\psi}$ in the latter case. With the matrix ${V}$ the analysis way reads as

$\displaystyle \tfrac12\|u-u_{0}\| + \alpha R(V^{T}u)$

One important observation is, that both approaches coincide if the matrix ${V}$ is orthonormal: Since ${V^{-1} = V^{T}}$ in this case, ${u=V\psi}$ is equivalent to ${\psi = V^{T}u}$, i.e. ${\psi_{k} = \langle u,v_{k}\rangle}$. For other sets of vectors ${v_{k}}$, this is not true and it is not so clear, how the analysis and the synthesis approach are related. In this post I’d like to show one instance where both approaches do lead to results that are not even close to being similar.

1. The “ROF analysis” approach to denoising

If ${u^{0}}$ is an image, the famous Rudin-Osher-Fatemi denoising model fits under the “analysis” framework: The vectors ${v_{k}}$ are all the finite differences of neighboring pixels such that ${V^{T}u = \nabla u}$ with the discrete gradient ${\nabla u}$. As sparsifying functional you take a mixed ${2}$${1}$ norm, i.e. ${R(\nabla u) = \|\nabla u\|_{2,1} = \sum_{ij}\sqrt{(\partial_{1}u_{ij})^{2} + (\partial_{2}u_{ij})^{2}}}$. The denoised ${u}$ is the solution to

$\displaystyle \min_{u} \tfrac12\|u-u^{0}\|^{2} + \alpha\|\nabla u\|_{2,1}$

Thus, the model will lead to some denoised ${u}$ with a sparse gradient. This sparse gradient is partly the reason for the success of the model in that it keeps edges, but is also responsible for the drawback of this approach: The denoised image will be mostly piesewise constant.

2. The “ROF synthesis” approach

As ${V^{T} = \nabla}$, the corresponding synthesis approach would be to solve

$\displaystyle \min_{\psi}\tfrac12\|\nabla^{T}\psi - u^{0}\|^{2} + \alpha \|\psi\|_{2,1}$

and the set ${u = \nabla^{T}\psi}$. In this approach, the denoised image will be a sparse linear combination of particular two-pixel images. If you think about that, it does not really sound like a good idea to approximate an image by a sparse linear combination of two-pixel images. (One technicality: The operator ${\nabla^{T}}$ is not onto – ${\nabla}$ has a non-trivial kernel, consisting of the constant images and hence, all images in the range of ${\nabla^{T}}$ have zero mean. Thus, to get something remotely close to ${u^{0}}$ one should subtract the mean of ${u^{0}}$ before the reconstruction and add it back afterwards.)

Here are some results:

Note that the results actually fit to what we would have predicted: For larger ${\alpha}$ we get a “sparser gradient” (less edges, less variation) for the analysis approach, and “sparser two-pixel combinations” for the synthesis approach. In fact, for the synthesis approach to give something that is remotely close to the image, one needs quite a lot two-pixel images, i.e. a very low sparsity.

In conclusion, one should not consider the analysis and synthesis approach to be something similar.

Incidentally, the dual of the analysis approach

$\displaystyle \min_{u} \tfrac12\|u-u^{0}\|^{2} + \alpha\|\nabla u\|_{2,1}$

can be written as

$\displaystyle \min_{\|\phi\|_{2,\infty}\leq\alpha}\tfrac12\|\nabla^{T}\phi-u^{0}\|^{2} = \min_{\phi}\tfrac12\|\nabla^{T}\phi-u^{0}\|^{2} + I_{\|\cdot\|_{2,\infty}\leq\alpha}(\phi)$

which looks related to the synthesis approach (one basically replaces the ${2,1}$-norm by its conjugate function). Actually, a similar relation between the dual of the analysis approach and the synthesis approach always holds.

Consider a convex optimization problem of the form

$\displaystyle \begin{array}{rcl} \min_{x}F(x) + G(Ax) \end{array}$

with convex ${F}$ and ${G}$ and matrix ${A}$. (We formulate everything quite loosely, skipping over details like continuity and such, as they are irrelevant for the subject matter). Optimization problems of this type have a specific type of dual problem, namely the Fenchel-Rockafellar dual, which is

$\displaystyle \begin{array}{rcl} \max_{y}-F^{*}(-A^{T}y) - G^{*}(y) \end{array}$

and under certain regularity conditions it holds that the optimal value of the dual equals the the objective value of the primal and, moreover, that a pair ${(x^{*},y^{*})}$ is both primal and dual optimal if and only if the primal dual gap is zero, i.e. if and only if

$\displaystyle \begin{array}{rcl} F(x^{*})+G(Ax^{*}) + F^{*}(-A^{T}y^{*})+G^{*}(y^{*}) = 0. \end{array}$

Hence, it is quite handy to use the primal dual gap as a stopping criteria for iterative methods to solve these problems. So, if one runs an algorithm which produces primal iterates ${x^{k}}$ and dual iterates ${y^{k}}$ one can monitor

$\displaystyle \begin{array}{rcl} \mathcal{G}(x^{k},y^{k}) = F(x^{k})+G(Ax^{k}) + F^{*}(-A^{T}y^{k})+G^{*}(y^{k}). \end{array}$

and stop if the value falls below a desired tolerance.

There is some problem with this approach which appears if the method produces infeasible iterates in the sense that one of the four terms in ${\mathcal{G}}$ is actually ${+\infty}$. This may be the case if ${F}$ or ${G}$ are not everywhere finite or, loosely speaking, have linear growth in some directions (since then the respective conjugate will not be finite everywhere). In the rest of the post, I’ll sketch a general method that can often solve this particular problem.

For the sake of simplicity, consider the following primal dual algorithm

$\displaystyle \begin{array}{rcl} x^{k+1} & = &\mathrm{prox}_{\tau F}(x^{k}-A^{T}y^{k})\\ y^{k+1} & = &\mathrm{prox}_{\sigma G^{*}}(y^{k}+\sigma A(2x^{k+1}-x^{k})) \end{array}$

(also know as primal dual hybrid gradient method or Chambolle-Pock’s algorithm). It converges as soon as ${\sigma\tau\leq \|A\|^{-2}}$.

While the structure of the algorithm ensures that ${F(x^{k})}$ and ${G^{*}(y^{k})}$ are always finite (since always ${\mathrm{prox}_{F}(x)\in\mathrm{dom}(F)}$), is may be that ${F^{*}(-A^{T}y^{k})}$ or ${G(Ax^{k})}$ are indeed infinite, rendering the primal dual gap useless.

Let us assume that the problematic term is ${F^{*}(-A^{T}y^{k})}$. Here is a way out in the case where one can deduce some a-priori bounds on ${x^{*}}$, i.e. a bounded and convex set ${C}$ with ${x^{*}\in C}$. In fact, this is often the case (e.g. one may know a-priori that there exist lower bounds ${l_{i}}$ and upper bounds ${u_{i}}$, i.e. it holds that ${l_{i}\leq x^{*}_{i}\leq u_{i}}$). Then, adding these constraints to the problem will not change the solution.

Let us see, how this changes the primal dual gap: We set ${\tilde F(x) = F(x) + I_{C}(x)}$ where ${C}$ is the set which models the bound constraints. Since ${C}$ is a bounded convex set and ${F}$ is finite on ${C}$, it is clear that

$\displaystyle \begin{array}{rcl} \tilde F^{*}(\xi) = \sup_{x\in C}\,\langle \xi,x\rangle - F(x) \end{array}$

is finite for every ${\xi}$. This leads to a finite duality gap. However, one should also adapt the prox operator. But this is also simple in the case where the constraint ${C}$ and the function ${F}$ are separable, i.e. ${C}$ encodes bound constraints as above (in other words ${C = [l_{1},u_{1}]\times\cdots\times [l_{n},u_{n}]}$) and

$\displaystyle \begin{array}{rcl} F(x) = \sum_{i} f_{i}(x_{i}). \end{array}$

Here it holds that

$\displaystyle \begin{array}{rcl} \mathrm{prox}_{\sigma \tilde F}(x)_{i} = \mathrm{prox}_{\sigma f_{i} + I_{[l_{i},u_{i}]}}(x_{i}) \end{array}$

and it is simple to see that

$\displaystyle \begin{array}{rcl} \mathrm{prox}_{\sigma f_{i} + I_{[l_{i},u_{i}]}}(x_{i}) = \mathrm{proj}_{[l_{i},u_{i}]}\mathrm{prox}_{\tau f_{i}}(x_{i}), \end{array}$

i.e., only uses the proximal operator of ${F}$ and project onto the constraints. For general ${C}$, this step may be more complicated.

One example, where this makes sense is ${L^{1}-TV}$ denoising which can be written as

$\displaystyle \begin{array}{rcl} \min_{u}\|u-u^{0}\|_{1} + \lambda TV(u). \end{array}$

Here we have

$\displaystyle \begin{array}{rcl} F(u) = \|u-u^{0}\|_{1},\quad A = \nabla,\quad G(\phi) = I_{|\phi_{ij}|\leq 1}(\phi). \end{array}$

The guy that causes problems here is ${F^{*}}$ which is an indicator functional and indeed ${A^{T}\phi^{k}}$ will usually be dual infeasible. But since ${u}$ is an image with a know range of gray values one can simple add the constraints ${0\leq u\leq 1}$ to the problem and obtains a finite dual while still keeping a simple proximal operator. It is quite instructive to compute ${\tilde F}$ in this case.

Yesterday I uploaded the paper An extended Perona-Malik model based on probabilistic models by Lars Mescheder and myself to the arXiv. Recently I already blogged about this work, so I do not have to add that much. The main theme of the work was that if we have an image that is a blurred and noisy version of some true image, we formulate the reconstruction via Bayesian statistics. As a prior model for images we used a Gaussian scale mixture, i.e. we have a latent variable $z$ (in every pixel $x$) and the joint prior for the image $u$ and the latent variable $z$ is

$p(u,z) \propto \exp\Big(-\sum_x \frac{z(x)}2 |\nabla u(x)|^2 + v(z(x))\Big)$

where $x$ denotes the pixels, $\nabla$ is the discrete gradient of the image and $v$ is some non-negative function defined on the non-negative reals. Besides algorithms to approximate the MAP estimate, Lars proposed a mean-field approximation which does not calculate the most probable image, but iteratively approximates the posterior distribution by distributions which factorize over the variables $u$ and $z$. Using some further approximation (since the resulting algorithm in its plain form is not implementable) one arrives at an algorithm which some keeps more uncertainty in $u$ and, in practice, gives a point estimate for the denoised image that seems more “representative”. Here is an example:

This is the blurred and noisy image (the blur is motions blur and implemented with periodic boundary conditions for simplicity):

The next image is the approximation of the MAP estimate we got and you see the usual drawbacks. Since the MAP neglects all uncertainty in $u$ and maximizes the posterior probability, the image is “too sharp” in the way that smooth transitions (e.g. at the lighthouse) turn into piecewise constant stairs. Also the rocks appear very blocky.

Here is the result from the mean-field approximation. Since uncertainty in $u$ is taken into account, the image does not suffer from staircasing and also has a more natural appeal, most prominently in the part with the rocks.

The paper has some more examples and also shows another relation to Mumford-Shah denoising (loosely speaking, one uses a discrete latent variable $z$ to serve as a binary variable to say if a pixel is an edge or not). Oh, by the way, the algorithms behind the mean-field approximation use some parts of more general duality between random variables that Lars and his co-authors develop in another paper.

Today I gave a talk at STOR colloquium at the University of North Carolina (UNC). I spoke about the Master’s thesis of Lars Mescheder in which he developed probabilistic models for image denoising. Among other things, he proposed a Gaussian scale mixture model as an image prior and developed several methods to infer information from the respective posterior. One curios result was that the Perona-Malik diffusivity (and variants thereof) popped up in basically every algorithm he derived. It’s not only that the general idea to have an edge dependent diffusion coefficient in a PDE or, formally equivalent, a concave penalty for the magnitude of the gradient in a variational formulation, but it also turned out that the very diffusion coefficient ${\frac1{1+|\nabla u|^{2}}}$ appeared exactly in this form.

Besides the talk I got the chance to chat with many interesting people. I’d like to mention a chat about a background story on Perona-Malik which I find quite interesting:

Steven Pizer, Kenan Professor for computer science at UNC and active in a lot of fields for several decades, and in particular a driving for in the early days of digital image analysis, told that the very idea of the non-linear, gradient-magnitude dependent diffusion coefficient actually dates back to works of neuroscientists Stephen Grossberg and that he moderated a discussion between Stephan Grossberg, Pietro Perona and Jitendra Malik in which the agreed that the method should be attributed as “Grossberg-Perona-Malik” method. I could not find any more hints for this story anywhere else, but given the mathematical, psychological and biological background of Grossberg and a look at Grossbergs list of publications in the 1980’s and earlier make this story appear more than plausible.

It may also well be, that this story is just another instance of the effect, that good ideas usually pop up at more than one place…

This term I am particularly busy as I am writing a book about variational methods in imaging. The book will be a textbook and I am writing it parallel to a lecture I am currently teaching. (And it is planned that Kristian Bredies, the co-author, teaches the same stuff next term – then there will be another few month of editing, so it will be at least a year until publishing.)

In the book we will treat variational approaches to a variety of basic imaging problems. Of course we treat denoising and deblurring but there will also be sections about image interpolation, segmentation and optical flow. In the first part of the book, we present the variational problem and model them properly in Lebesgue and Sobolev spaces and of course in the space ${BV}$. Some effort goes into the analysis of the models and the first step is usually to establish existence of solutions, i.e. minimizers of the respective minimization problems. The work horse is the direct method in the calculus of variations and we mainly use the method for convex functionals in Banach spaces.

When I started the section on optical flow I noticed that I hadn’t thought about existence of minimizers before and moreover, most papers and books do not treat this issue. Let’s recall the method of Horn and Schunck to calculate the optical flow:

For two images ${u_0}$, ${u_1}$ defined on a domain ${\Omega\subset{\mathbb R}^2}$ one seeks a flow field ${V:\Omega\rightarrow{\mathbb R}^2}$ such that ${V}$ the describes the apparent motion that has happened between both images. Assuming that the points keep their gray value during motion (an assumption known as the brightness constancy constraint) and linearizing this assumption one arrives at the condition

$\displaystyle \frac{u_1-u_0}{dt} + V\cdot\nabla u_0 = 0$

(where ${dt}$ is the time between the images ${u_0}$ and ${u_1}$). First, this does not give enough equations to determine ${V}$ and secondly, points with ${\nabla u_0=0}$ are problematic.

Horn and Schunck proposed to loose the constraint and to enforce some smoothness of the flow field ${V}$: Their model was to minimize

$\displaystyle F(V) = \int_\Omega\big(\frac{u_1-u_0}{dt} + V\cdot\nabla u_0\big)^2{\mathrm d}{x} + \lambda\int_\Omega|\nabla V|^2{\mathrm d}{x}$

for some parameter ${\lambda}$ weighting smoothness of ${V}$ (large ${\lambda}$) against the brightness constancy constraint (small ${\lambda}$). A little bit more general one could choose exponents ${p}$ and ${q}$ and minimize

$\displaystyle F(V) = \int_\Omega\big|\frac{u_1-u_0}{dt} + V\cdot\nabla u_0\big|^q{\mathrm d}{x} + \lambda\int_\Omega|\nabla V|^p{\mathrm d}{x}.$

To apply the direct method to obtain existence of minimizers of ${F}$ one ensures

1. properness, i.e. there is some ${V}$ such that ${F}$ is finite,
2. convexity of ${F}$,
3. lower semi-continuiuty of ${F}$ and
4. coercivity of ${F}$.

To check these things one has to choose an appropriate space to work in. It seems reasonable to choose ${V\in L^{q}(\Omega,{\mathbb R}^2)}$. Then properness of ${F}$ is easy (consider ${V=0}$, of course assuming that ${u_1-u_0\in L^q(\Omega)}$). Convexity is also clear and for lower semi-continuity one has to work a little more, but that is possible if, e.g., ${\nabla u_0}$ is bounded. Coercivity is not that clear and in fact ${F}$ is not coercive in general.

Example 1 (Non-coercivity of the Horn-and-Schunck-model) Simply consider ${u_0(x,y) = ax + by}$ for some ${a,b\in{\mathbb R}}$. Then ${\nabla u(x,y) \equiv [a\ b]^T}$. Set ${V_n(x,y) \equiv [-nb\ na]^T}$ and note that ${\|V^n\|_q\rightarrow\infty}$ while ${F(V^n)}$ stays bounded (in fact constant).

I just checked the book “Mathematical problems in Imaging” by Gilles Aubert and Pierre Kornprobst and in Section 5.3.2 they mention that the Horn and Schunck model is not coercive. They add another term to ${F}$ which is roughly a weighted norm of ${V}$ which ensures coercivity. However, it turns out that coercivity of ${F}$ is true under a mild assumption of ${u_0}$. The idea can be found in a pretty old paper by Christoph Schnörr which is called “ Determining Optical Flow for Irregular Domains by Minimizing Quadratic Functionals of a Certain Class” (Int. J. of Comp. Vision, 6(1):25–38, 1991). His argument works for ${q=2}$:

Theorem 1 Let ${\Omega\subset{\mathbb R}^2}$ be a bounded Lipschitz domain, ${u_0,u_1\in L^2(\Omega)}$ with ${\nabla u_0\in L^\infty(\Omega)}$ such that ${\partial_x u_0}$ and ${\partial_y u_0}$ are linearly independent in ${L^2(\Omega)}$ and let ${1. Then it holds that ${F:L^2(\Omega)\rightarrow {\mathbb R}\cup\{\infty\}}$ defined by

$\displaystyle F(V) = \int_\Omega\big(\frac{u_1-u_0}{dt} + V\cdot\nabla u_0\big)^2{\mathrm d}{x} + \lambda\int_\Omega|\nabla V|^2{\mathrm d}{x}$

is coercive.

Proof: Now consider ${V^n}$ such that ${\|V^n\|_2\rightarrow\infty}$. Now we decompose the components of ${V}$ into the constant parts ${QV^n_x}$ and ${QV^n_y}$ and the “zero-mean”-part ${PV^n_x = V^n_x - QV^n_x}$ and ${PV^n_y = V^n_y - QV^n_y}$. First consider that ${PV^n}$ is unbounded, i.e. there is subsequence (also denoted by ${V^n}$) such that ${\|PV^n\|_2\rightarrow\infty}$. By Sobolev embedding and the \href{http://en.wikipedia.org/wiki/Poincar inequality}, we get that ${\int_\Omega|\nabla V^n|^p{\mathrm d}{x}\rightarrow\infty}$.

Now consider bounded ${PV^n}$ and hence, unbounded mean values ${QV^n}$. Using a subsequence, we assume that ${QV^n\rightarrow\infty}$. Now we use

$\displaystyle \Big\|\frac{u_1 - u_0}{\Delta t} + V\cdot \nabla u_0\Big\|_2 \geq \Big\|QV\cdot\nabla u_0\Big\|_2 - \Big\|\frac{u_1 - u_0}{\Delta t} + PV\cdot \nabla u_0\Big\|_2 \ \ \ \ \ (1)$

and estimate the first term from below, noticing that ${QV_x}$ and ${QV_y}$ are constants, by

$\displaystyle \begin{array}{rcl} \|QV\cdot\nabla u_0\|_2^2 & = &\|QV_x\,\partial_x u_0 + QV_y\,\partial_y u_0\|_2^2\\ & = & \|QV_x\,\partial_x u_0\|_2^2 + \|QV_y\,\partial_y u_0\|_2^2 + 2\langle QV_x\,\partial_x u_0,QV_y\,\partial_y u_0\rangle\\ & \geq &|QV_x|^2\|\partial_x u_0\|_2^2 + |QV_y|^2\|\partial_y u_0\|_2^2\\ &&\qquad - \|QV_x\,\partial_xu_0\|_2\|QV_y\,\partial_yu_0\|_2\,2\frac{|\langle \partial_x u_0,\partial_y u_0\rangle|}{\|\partial_xu_0\|_2\|\partial_y u_0\|_2}\\ & \geq &(|QV_x|^2\|\partial_x u_0\|_2^2 + |QV_y|^2\|\partial_y u_0\|_2^2) \Big(1 - \frac{|\langle \partial_x u_0,\partial_y u_0\rangle|}{\|\partial_xu_0\|_2\|\partial_y u_0\|_2}\Big). \end{array}$

Since ${\partial_x u_0}$ and ${\partial_y u_0}$ are linearly independent, it holds that ${1 - \frac{|\langle \partial_x u_0,\partial_y u_0\rangle|}{\|\partial_xu_0\|_2\|\partial_y u_0\|_2}>0}$ and we conclude that ${\|QV^{n_k}\|_2\rightarrow\infty}$ implies that ${\|QV^{n_k}\cdot\nabla u_0\|_2^2\rightarrow\infty}$. Together with~(1) and boundedness of ${PV^{n_k}}$ we obtain that ${F(V^{n_k})\rightarrow\infty}$. Since for every subsequence of ${V^n}$ we get another subsequence ${V^{n_k}}$ such that ${F(V^{n_k})\rightarrow\infty}$, the same conclusion holds for the whole sequence, showing coercivity of ${F}$. $\Box$

Basically the same arguments works for ${TV}$ optical flow, i.e. coercivity of

$\displaystyle F(V) = \int_\Omega\big(\frac{u_1-u_0}{dt} + V\cdot\nabla u_0\big)^2{\mathrm d}{x} + \lambda TV(V).$

However, I do not know yet what happens for ${q\neq 2}$ and if the result on coercivity is “sharp” in the sense that linear independence of ${\partial_x u_0}$ and ${\partial_y u_0}$ is necessary. Also, I don’t know yet what is true in dimensions higher than ${2}$.

Here I continue my previous post on methods to compare shapes. In that post we started with different metrics between two probability measures ${\mu_1}$ and ${\mu_2}$ defined on the same set ${X}$, namely the Prokhorov metric and the Wasserstein metric. Then we also had a look on the Hausdorff metric which measures the distance between two compact subsets ${A}$ and ${B}$ of a metric space ${(X,d)}$ and finally considered the Gromov-Hausdorff metric between two metric spaces ${(X,d_X)}$ and ${(Y,d_Y)}$. Our tools have been

• set couplings of ${X}$ and ${Y}$, i.e. subsets ${R\subset X\times Y}$ such that for all ${x\in X}$ there ${y\in Y}$ such that ${(x,y)\in R}$ and for all ${y\in Y}$ there is ${x\in X}$ such that ${(x,y)\in R}$,
• metric couplings of ${d_X}$ and ${d_Y}$, i.e. metrics ${d}$ on the disjoint union of ${X}$ and ${Y}$ such that ${d(x,x') = d_X(x,x')}$ if ${x}$ and ${x'}$ are in ${X}$ and ${d(y,y') = d_Y(y,y')}$ if ${y}$ and ${y'}$ are in ${Y}$. In fact, one could also work with semi-metrics ${d}$, i.e. they do not need to be positive definite, and
• measure couplings of ${\mu_X}$ and ${\mu_Y}$, i.e. measures ${\nu}$ on ${X\times Y}$ such that ${\nu(A\times Y) = \mu_X(A)}$ and ${\nu(X\times B) = \mu_Y(B)}$ for all ${\mu_X}$-/${\mu_Y}$-measurable set ${A}$ and ${B}$, respectively.

Now we make the next (and final) step and compare metric spaces ${(X,d_X)}$ and ${(Y,d_Y)}$ which are both equipped with a measure. These objects are known as metric measure spaces or mm-spaces and are formally defined as follows:

Definition 1 (mm-space) A metric measure space is a tripel ${(X,d_X,\mu_X)}$ consisting a compact metric space ${(X,d_X)}$ and a Borel probability measure ${\mu_X}$ on ${X}$.

Note that sometimes it is included that ${\mu_X}$ has full support (i.e., equal to ${X}$) in the definition of an mm-space, but it seems that not everybody does it like that.

1. Comparing mm-spaces: Gromov-Wasserstein

Our question is: How to we compare two mm-spaces ${(X,d_X,\mu_X)}$ and ${(Y,d_Y,\mu_Y)}$? The plan is simply to augment the previous versions of the Gromov-Hausdorff distances defined in my previos post here and here by something which takes the measures on the respective metric spaces into account. We recall both formulations of the Gromov-Hausdorff distance: The first is

$\displaystyle d_{GH}(X,Y) = \inf_{R,d} \sup_{(x,y)\in R} d(x,y) \ \ \ \ \ (1)$

where the infimum is taken over all set couplings ${R}$ of ${X}$ and ${Y}$ and metric couplings ${d}$ of ${d_X}$ and ${d_Y}$, and the second is

$\displaystyle d_{GH}(X,Y) = \tfrac12\inf_R \sup_{\overset{\overset{x_{1/2}\in X}{y_{1/2}\in Y}}{(x_i,y_i)\in R}}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big| \ \ \ \ \ (2)$

where the infimum is also taken over all set couplings ${R}$ of ${X}$ and ${Y}$.

Basically, we have already seen what we should do to build a metric between mm-spaces. The idea is: if there were some “natural measure” ${\mu_R}$ for any set coupling ${R}$ of ${X}$ and ${Y}$, then we would simply define the distance between ${(X,d_X,\mu_X)}$ and ${(Y,d_Y,\mu_Y)}$ as

$\displaystyle d_{GH}(X,Y) = \inf_{R,d} \Big(\int d(x,y)^pd\mu_R\Big)^{1/p}$

(as a generalization of (1)) and

$\displaystyle d_{GH}(X,Y) = \tfrac12\inf_R \Big(\int_{R\times R}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big|^p d\mu_R(x_1,y_1)d\mu_R(x_2,y_2)\Big)^{1/p}$

(as a generalization of (2)). In both cases we can have ${1\leq p<\infty}$. Note that the obvious modification for ${p=\infty}$ leads to something very similar to the Gromov-Hausdorff metrics.

But indeed there are “natural measures” on the set couplings of ${X}$ and ${Y}$: At least for the full coupling ${R= X\times Y}$ there are the measure couplings ${\nu}$ of ${\mu_X}$ and ${\mu_Y}$! (One does not need to consider smaller set couplings ${R\subsetneq X\times Y}$ since this can be taken into account by the measure couplings; they do not need to have full support anyway.)

Applying this idea to the version (1) of the Gromov-Hausdorff metric we arrive at the following expression, which can be called Gromov-Wasserstein metric,

$\displaystyle d_{Gw}^1(X,Y) = \inf_{\nu,d} \Big(\int_{X\times Y} d(x,y)^pd\nu(x,y)\Big)^{1/p} \ \ \ \ \ (3)$

where the infimum is taken over all measure couplings ${\nu}$ of ${\mu_X}$ and ${\mu_Y}$ and all metric couplings ${d}$ of ${d_X}$ and ${d_Y}$.

Starting from the version (2) of the Gromov-Hausdorff metric we arrive at another formulation:

$\displaystyle d_{GW}^2(X,Y) = \tfrac12\inf_\nu \Big(\int_{X\times Y}\int_{X\times Y}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big|^p d\nu(x_1,y_1)d\nu(x_2,y_2)\Big)^{1/p} \ \ \ \ \ (4)$

where the infimum is taken over all measure couplings ${\nu}$ of ${\mu_X}$ and ${\mu_Y}$.

While both versions of the Gromov-Hausdorff metric for compact metric spaces where equal, the same is not true for both generalizations to mm-spaces: In his paper Memoli proves that ${d_{GW}^2\leq d_{GW}^1}$ and gives an example (right after Remark 5.14) where strict inequality holds.

2. Comparing mm-spaces: Gromov-Prokhorov

Instead of starting from the Gromov-Hausdorff distance between metric spaces and augmenting their definition with something that takes the measures into account, we could also start from the “>Prokhorov metric between probability measures and augment the definition with something that takes the metric into account. In fact there are also two possibilities to do so: In the appendix of his paper, Memoli quotes this version (from this paper by Greven, Pfaffelhuber and Winter) of a metric between mm-spaces which we call Gromov-Prokhorov metric

$\displaystyle d_{GP}^1(X,Y) = \inf_{\nu,d}\Big\{\epsilon>0\ :\ \nu\{(x,y)\ :\ d(x,y)\geq\epsilon\}\leq\epsilon\Big\} \ \ \ \ \ (5)$

where the infimum is taken over all measure couplings ${\nu}$ of ${\mu_X}$ and ${\mu_Y}$ and all metric couplings ${d}$ of ${d_X}$ and ${d_Y}$.

The next version (also from the same paper by Greven, Pfaffelhuber and Winter where it was called Eurandom metric) is

$\displaystyle d_{GP}^2(X,Y) = \inf_\nu\Big\{\epsilon>0\ :\ \nu\otimes\nu\{(x_1,y_1,x_2,y_2)\ :\ |d_X(x_1,x_2) - d_Y(y_1,y_2)|\geq\epsilon\}\leq\epsilon\Big\} \ \ \ \ \ (6)$

where the infimum is taken over all measure couplings ${\nu}$ only.

3. A very simple example

The calculation of the proposed metrics by hand can be quite cumbersome. Let’s look at the simplest example.

We consider metric spaces ${X,Y\subset{\mathbb R}^d}$ (with the euclidean metric) accompanied with the measures ${\mu_X=\delta_{x_0}}$ and ${\mu_Y = \delta_{y_0}}$ for some points ${x_0\in X}$ and ${y_0\in Y}$. In this case the is only one measure coupling of ${\mu_X}$ and ${\mu_Y}$, namely

$\displaystyle \nu = \delta_{(x_0,y_0)}.$

Now it is easy to calculate the variant (4) of the Gromov-Wasserstein metric:

$\displaystyle \begin{array}{rcl} d_{GW}^2(X,Y) &=& \tfrac12\Big(\int_{X\times Y}\int_{X\times Y}\big| |x_1-x_2| - |y_1 -y_2|\big|^p d\delta_{(x_0,y_0)}(x_1,y_1)d\delta_{(x_0,y_0)}\Big)^{1/p} \\ &=& \tfrac12\Big(\int_{X\times Y}\big| |x_0-x_2| - |y_0 -y_2|\big|^pd\delta_{(x_0,y_0)}\Big)^{1/p} \\ &=& 0. \end{array}$

Let’s have a look at the variant (3): Since there is only one measure coupling, the metric is

$\displaystyle \begin{array}{rcl} d_{GW}^1(X,Y) & = &\inf_d\Big(\int_{X\times Y} d(x,y)^pd\delta_{(x_0,y_0)}(x,y)\Big)^{1/p} \\ & = &\inf_d d(x_0,y_0). \end{array}$

As we have learned in Example 4 in the previous post, we can find a metric coupling of ${X}$ and ${Y}$ that brings the points ${x_0}$ in ${X}$ and ${y_0}$ in ${Y}$ arbitrarily close together (by embedding both ${X}$ and ${Y}$ into some ${{\mathbb R}^n}$ such that these points are only ${\epsilon}$-far away from each other). Hence, we see that we have

$\displaystyle d_{GW}^1(X,Y) = 0$

similarly to ${d_{GW}^2}$.

Now let’s look at the Gromov-Prokhorov metric from (5). Again we only have one measure coupling and we get

$\displaystyle d_{GP}^1(X,Y) = \inf\Big\{\epsilon>0\ :\ \exists d\ \text{s.t}\ \delta_{(x_0,y_0)}\{(x,y)\ :\ d(x,y)\geq\epsilon\}\leq\epsilon\Big\}.$

Since the measure coupling is a Dirac, we can evaluate

$\displaystyle \delta_{(x_0,y_0)}\{(x,y)\ :\ d(x,y)\geq\epsilon\} = \begin{cases} 1 & d(x_0,y_0)\geq \epsilon\\ 0 & d(x_0,y_0)<\epsilon. \end{cases}$

As observed previously, there are metric couplings which bring the points ${x_0}$ and ${y_0}$ arbitrarily close together, and hence for any ${\epsilon>0}$ there is a metric coupling such that ${\delta_{(x_0,y_0)}\{(x,y)\ :\ d(x,y)\geq\epsilon\} = 0}$ which shows that

$\displaystyle d_{GP}^1(X,Y) = 0.$

Finally, consider the second variant of the Gromov-Prokhorov metric from (6). Since we only have the one measure coupling ${\nu = \delta_{(x_0,y_0)}}$ we have

$\displaystyle d_{GP}^2(X,Y) = \inf\Big\{\epsilon>0\ :\ \delta_{(x_0,y_0)}\otimes\delta_{(x_0,y_0)}\{(x_1,y_1,x_2,y_2)\ :\ \big| |x_1-x_2| - |y_1-y_2|\big|\geq\epsilon\}<\epsilon\Big\}.$

Evaluating the tensored Dirac delta is easy: We have that ${\delta_{(x_0,y_0)}\otimes\delta_{(x_0,y_0)}\{(x_1,y_1,x_2,y_2)\ :\ \big| |x_1-x_2| - |y_1-y_2|\big|\geq\epsilon\}}$ is either one or zero and it is one if and only if the point ${(x_0,y_0,x_0,y_0)}$ is in the set that is measured. However, we have that ${\big| |x_0-x_0| - |y_0-y_0|\big|}$ is, of course, always zero (and never larger than any ${\epsilon >0}$). Hence, the measure is always zero and hence, this version of the Gromov-Prokhorov distance also gives

$\displaystyle d_{GP}^2(X,Y) = 0.$

Note that all four metrics can not see that the two Diracs are at different points. The reason seems to be, that one can “deform the metric space outside of the support of the measures” arbitrarily, in some sense.

It seems, that the computation of ${d_{GW}^2}$ and ${d_{GP}^2}$ were easier, since one needs to know all measure couplings and no metric couplings has been involved.

4. Next difficult example: Compare some mm-space to a point

Ok, we have seen that mm-spaces which carry their measure in just a single point all look alike in both versions of the Gromov-Wasserstein metric and also in both versions of the Gromov-Prokhorov metric. Let’s look at a slightly more difficult example: We consider some mm-space ${(X,d_X,\mu)}$ and want to calculate its distance to a single point, i.e. the mm-space ${Y=\{0\}}$ with the only possible metric and measure. This should somehow measure the “size” or “spread” of the mm-space ${X}$.

First, we need to know all measure couplings and all metric couplings between these spaces. The measure couplings are very easy: There is just one, namely

$\displaystyle \nu = \mu\otimes \delta$

(i.e. all subsets of ${X\times \{0\}}$ are treated as if they were subsets of ${X}$). Concerning metric couplings, there are a few more. We allow semi-metrics ${d}$ on the disjoint union ${X\sqcup\{0\}}$: Since ${d}$ should respect the metric ${d_X}$ on ${X}$ we see that all metric couplings are parametrized by the points ${x_0}$ in ${X}$ by identifying ${0}$ (the element in ${Y}$) with this point ${x_0}$, i.e. all metric couplings are of the form ${d_{x_0}}$ defined by

$\displaystyle d_{x_0}(x,y) = d_X(x,y)\ (\text{for}\ x,y\in X),\qquad d_{x_0}(x,0) = d_X(x,x_0).$

(This only gives semi-metrics since we have ${d_{x_0}(x_0,0)=0}$ although, formally ${x_0\not 0}$.)

Let’s calculate the first Gromov-Wasserstein metric: There is only one measure coupling and we use the parametrization of the metric couplings to deduce

$\displaystyle \begin{array}{rcl} d_{GW}^1(X,\{0\}) &=& \inf_d\Big(\int_{X\times\{0\}} d(x,0)^pd(\mu\otimes \delta)(x,y)\Big)^{1/p}\\ &=& \inf_{x_0\in X}\Big(\int_{X} d_X(x,x_0)^pd\mu\Big)^{1/p}. \end{array}$

This quantity seems to be known as “minimal ${p}$-th central moment” of ${(X,d_X,\mu)}$.

The second variant of the Gromov-Wasserstein metric is (remember, there is only one measure coupling)

$\displaystyle \begin{array}{rcl} d_{GW}^2(X,\{0\}) &=& \tfrac12\Big(\int_{X\times\{0\}}\int_{X\times\{0\}} |d_X(x_1,x_2) - d_Y(0,0)|^p d(\mu\times\delta)(x_1,y_1) d(\mu\times\delta)(x_2,y_2) \Big)^{1/p}\\ &=& \tfrac12\Big(\int_X\int_X d_X(x_1,x_2)^p d\mu d\mu\Big)^{1/p}. \end{array}$

This quantity (without the factor ${1/2}$) is called the “${p}$-diameter” of ${(X,d_X)}$.

Let’s turn to the Gromov-Prokhorov metrics. The first one is (remember, that the metric couplings ${d}$ are parametrized by the points in ${X}$)

$\displaystyle \begin{array}{rcl} d_{GP}^1(X,\{0\}) &=& \inf_{d}\Big\{\epsilon>0\ :\ (\mu\otimes\delta)\{(x,0)\ :\ d(x,0)\geq\epsilon\}\leq\epsilon\Big\}\\ &=& \inf_{x_0\in X}\Big\{\epsilon>0\ :\ \mu\{x\ :\ d_X(x,x_0)\geq\epsilon\}\leq\epsilon\Big\}. \end{array}$

If this looks familiar, then you may have encountered the Ky-Fan metric already? The Ky-Fan metric is a metric between random variables ${\xi_1}$ and ${\xi_2}$ defined of the same probability space ${(X,\mu)}$ with values in a metric space with metric ${d}$. It reads as

$\displaystyle d_{KF}(\xi_1,\xi_2) = \inf\Big\{\epsilon>0\ :\ \mu\{\omega\ :\ d(\xi_1(\omega),\xi_2(\omega))\geq\epsilon\}\leq\epsilon\Big\}.$

Hence, the first version of the Gromov-Prokhorov metric is

$\displaystyle d_{GP}^1(X,\{0\}) = \inf_{x_0\in X}\ d_{KF}(\mathrm{id},x_0),$

i.e., the minimal Ky-Fan metric between the identity mapping and the constant mappings. (In other words, it measures how far the identity is from the constant mappings in the sense of Ky Fan.)

The second variant of the Gromov-Prokhorov metric is (remember, the only measure coupling is ${\nu = \mu\times \delta}$)

$\displaystyle \begin{array}{rcl} d_{GP}^2(X,\{0\}) &=& \inf \Big\{\epsilon>0\ :\ \nu\otimes\nu\{(x_1,0,x_2,0)\ :\ |d_X(x_1,x_2) -0|\geq\epsilon\}\leq\epsilon\Big\}\\ &=& \inf\Big\{\epsilon>0\ :\ \mu\otimes\mu\{(x_1,x_2)\ :\ d_X(x_1,x_2)\geq\epsilon\}\leq\epsilon\Big\}. \end{array}$

I do not have a neat name or a good intuition for this metric yet (although it also looks like it measures “size” or “non-localization” of ${X}$ in some sense). If you have one, let me know!

With this post I delve into a topic which is somehow new to me, although I planned to look deeper into this for quite some time already. I stumbled upon the paper Gromov-Wasserstein distances and the metric approach to object matching by Facundo Mémoli which was a pleasure to read and motivated this post.

1. Comparing measures with norms and metrics

There are different notions in mathematics to compare two objects, think of the size of real numbers, the cardinality of sets or the length of the difference of two vectors. Here we will deal with not only comparison of objects but with “measures of similarity”. Two fundamental notions for this are norms in vector spaces and metrics. The norm is the stronger concept in that it uses more structure than a metric and also, every norm induces a metric but not the other way round. There are occasions in which both a norm and a metric are available but lead to different concepts of similarity. One of these instances occurs in sparse recovery, especially in the continuous formulation, e.g. as described in a previous post. Consider the unit interval ${I = [0,1]}$ and two Radon measures ${\mu_1}$ and ${\mu_2}$ on ${I}$ (${I}$ could also be an aritrary metric space). On the space of Radon measures ${\mathfrak{M}(I)}$ there is the variation norm

$\displaystyle \|\mu\|_{\mathfrak{M}}= \sup_\Pi\sum_{A\in\Pi}|\mu(A)|$

where the supremum is taken over all partitions ${\Pi}$ of ${I}$ into a finite number of measurable sets. Moreover, there are different metrics one can put on the space of Radon measures, e.g. the Prokhorov metric which is defined for two probability measures (e.g. non-negative ones with unit total mass)

$\displaystyle \begin{array}{rcl} d_P(\mu_1,\mu_2) & = & \inf\{\epsilon>0\ :\ \mu_1(A)\leq \mu_2(A^\epsilon) + \epsilon,\nonumber\\ & & \qquad \mu_2(A)\leq \mu_1(A^\epsilon) + \epsilon\ \text{for all measurable}\ A\} \end{array}$

where ${A^\epsilon}$ denotes the ${\epsilon}$-neighborhood of ${A}$. Another familiy of metrics are the Wasserstein metrics: For ${p\geq 1}$ define

$\displaystyle d_{W,p}(\mu_1,\mu_2) = \Big(\inf_\nu\int_{I\times I} |x-y|^p d\nu(x,y)\Big)^{1/p} \ \ \ \ \ (1)$

where the infimum is taken over all measure couplings of ${\mu_1}$ and ${\mu_2}$, that is, all measures ${\nu}$ on ${I\times I}$ such that for measurable ${A}$ it holds that

$\displaystyle \nu(A\times I) = \mu_1(A)\ \text{and}\ \nu(I\times A) = \mu_2(A).$

Example 1 We compare two Dirac measures ${\mu_1 = \delta_{x_1}}$ and ${\mu_2 = \delta_{x_2}}$ located at distinct points ${x_1\neq x_2}$ in ${I}$ as seen here:

The variation norm measures their distance as

$\displaystyle \|\mu_1-\mu_2\|_{\mathfrak{M}} = \sup_\Pi\sum_{A\in\Pi}|\delta_{x_1}(A) - \delta_{x_2}(A)| = 2$

(choose ${\Pi}$ such that it contains ${A_1}$ and ${A_2}$ small enough that ${x_1\in A_1}$, ${x_2\in A_2}$ but ${x_1\notin A_2}$ and ${x_2\notin A_1}$). The calculate the Prokhorov metric note that you only need to consider ${A}$‘s which contain only one of the points ${x_{1/2}}$ and hence, it evaluates to

$\displaystyle d_P(\mu_1,\mu_2) = |x_1-x_2|.$

For the Wasserstein metric we observe that there is only one possible measure coupling of ${\delta_{x_1}}$ and ${\delta_{x_2}}$, namely the measure ${\nu = \delta_{(x_1,x_2)}}$. Hence, we have

$\displaystyle d_{W,p}(\mu_1,\mu_2) = \Big(\int_{I\times I}|x-y|^pd\delta_{(x_1,x_2)}(x,y)\Big)^{1/p} = |x_1-x_2|.$

The variation norm distinguishes the two Diracs but is not able to grasp the distance of their supports. On the other hand, both metrics return the geometric distance of the supports in the underlying space ${I}$ as distance of the Diracs. Put in pictures: The variation norm of the difference measures the size ob this object

while both metrics capture the distance of the measures like here

It should not stay unnoted that convergence in both the Prokhorov metric and the Wasserstein metrics is exactly the weak convergence of probability measures.

The above example provides a motivation to study metric structures on spaces, even if they are also equipped with a norm. Another reason to shift attention from normed spaces to metric spaces is the fact that there has emerged a body of work to build a theory of analysis in metric spaces (see, e.g. this answer on mathoverflow or the book Gradient Flows: In Metric Spaces And In The Space Of Probability Measures by Ambrosio, Gigli and Savaré (which puts special emphasis on the space of probability measures)). Yet another motivation for the study of metrics in this way is the problem of comparing shapes (without being precisely defined yet): Which of these shapes look most alike?

(Note that shapes need not to be two dimensional figures, you may also think of more complex objects like surfaces in three dimensions or Riemannian manifolds.)

One may also ask the question how two compare different images defined on different shapes, i.e. different “distributions of colour” on two different shapes.

2. Comparing shapes: Metric spaces

Up to now we tried to compare different measures, defined on the same set. At least to me it seems that both the Prokhorov and the Wasserstein metrics are suited to measure the similarity of measures and in fact, they do so somehow finer than the usual norm does.

Let’s try to go one step further and ask ourselves, how we could compare two measures ${\mu_1}$ and ${\mu_2}$ which are defined on two different sets? While thinking about an answer one need to balance several things:

• The setup should be general enough to allow for the comparison of a wide range of objects.
• It should include enough structure to allow meaningful statements.
• It should lead to a measure which is easy enough to handle both analytically and computationally.

For the first and second bullet: We are going to work with measures not on arbitrary sets but on metric spaces. This will allow to measure distances between points in the sets and, as you probably know, does not pose a severe restriction. Although metric spaces are much more specific than topological spaces, we still aim at quantitative measures which are not provided by topologies. With respect to the last bullet: Note that both the Prokhorov and the Wasserstein metric are defined as infimums over fairly large and not too well structured sets (for the Prokhorov metric and need to consider all measurable sets and their ${\epsilon}$-neighborhoods, for the Wasserstein metric, one need to consider all measure couplings). While they can be handled quite well theoretically, their computational realization can be cumbersome.

In a similar spirit than Facundo Memoli’s paper we work our way up from comparing subsets of metric spaces up to comparing two different metric spaces with two measures defined on them.

2.1. Comparing compact subsets of a metric space: Hausdorff

Let ${(X,d)}$ be a compact metric space. Almost hundred years ago Hausdorff introduced a metric on the family of all non-empty compact subsets of a metric space as follows: The Hausdorff metric of two compact subsets ${A}$ and ${B}$ of ${X}$ is defined as

$\displaystyle d_H(A,B) = \inf\{\epsilon>0 \ :\ A\subset B_\epsilon,\ B \subset A_\epsilon\}$

(again, using the notion of ${\epsilon}$-neighborhood). This definition seems to be much in the spirit of the Prokhorov metric.

Proposition 2.1 in Facundo Memolis paper shows that the Hausdorff metric has an equivalent description as

$\displaystyle d_H(A,B) = \inf_R \sup_{(a,b) \in R} d(a,b)$

where the infimum is taken over all correspondences ${R}$ of ${A}$ and ${B}$, i.e., all subset ${R\subset A\times B}$ such that for all ${a\in A}$ there is ${b\in B}$ such that ${(a,b) \in R}$ and for all ${b\in B}$ there ${a\in A}$ such that ${(a,b)\in R}$. One may also say set coupling of ${A}$ and ${B}$ instead of correspondence.

Example 2 There is always the full coupling ${R = A\times B}$. Three different set couplings of two subsets ${A}$ and ${B}$ of the unit interval are shown here:

the “full one” ${A\times B}$ in green and two “slim” ones in red and orange. Other “slim” couplings can be obtained from surjective mappings ${f:A\rightarrow B}$ by ${R = \{(a,f(a))\ :\ a\in A\}}$ (or with the roles of ${A}$ and ${B}$ swapped): If you couple a set ${A}$ with itself, there is also the trivial coupling

$\displaystyle R = \{(a,a)\ : \ a\in A\}$

which is just the diagonal of ${A\times A}$

Note that the alternative definition of the Hausdorff metric is more in the spirit of the Wasserstein metric: It does not use enlarged objects (by ${\epsilon}$-neighborhoods) but couplings.

The Hausdorff metric is indeed a metric on the set ${\mathfrak{C}(X)}$ of all non-empty compact subsets of a metric space ${X}$ and if ${X}$ itself is compact it even holds that ${(\mathfrak{C}(X),d_H)}$ is a compact metric space (a result, known as Blaschke Selection Theorem).

One may say that we went up an abstraction ladder one step by moving from ${(X,d)}$ to ${(\mathfrak{C}(X),d_H)}$.

2.2. Comparing compact metric spaces: Gromov-Hausdorff

In the previous subsection we worked within one metric space ${X}$. In the book “Metric Structures for Riemannian and Non-Riemannian Spaces” Misha Gromov introduced a notion to compare two different metric spaces. For compact metric space ${X}$ and ${Y}$ the Gromov-Hausdorff metric is defined as

$\displaystyle d_{GH}(X,Y) = \inf_{Z,f,g} d_H(f(X),g(Y)) \ \ \ \ \ (2)$

where the infimum is taken over

• all metric spaces ${Z}$ and
• all isometric embeddings ${f}$ and ${g}$ which embed ${X}$ and ${Y}$ into ${Z}$ respectively.

In words: To compute the Gromov-Hausdorff metric, you try embed both ${X}$ and ${Y}$ into a common larger space isometrically such that they are as close as possible according to the Hausdorff metric in that space.

Strictly speaking, the above definition is not well stated as one can not form an infimum over all metric spaces since this collection does not form a set according to the rules of set theory. More precisely one should write that ${d_{GH}(X,Y)}$ is the infimum over all ${r>0}$ such that there exists a metric space ${Z}$ and isometric embeddings ${f}$ and ${g}$ of ${X}$ and ${Y}$, respectively, such that ${d_H(f(X),g(Y)).

As the Hausdorff metric could be reformulated with set couplings there is a reformulation of the Gromov-Hausdorff metric based on metric couplings: A metric coupling of two metric spaces ${(X,d_X)}$ and ${(Y,d_Y)}$ is a metric ${d}$ on the disjoint union ${X\sqcup Y}$ of ${X}$ and ${Y}$ such that for all ${x,x'\in X}$ and ${y,y'\in Y}$ it holds that ${d(x,x') = d_X(x,x')}$ and ${d(y,y') = d_Y(y,y')}$.

Example 3 We couple a metric space ${(X,d)}$ with itself. We denote with ${(X',d')}$ an identical copy of ${(X,d)}$ and look for a metric ${D}$ on ${X\times X'}$ that respects the metrics ${d}$ and ${d'}$ in the way a metric coupling has to.

To distinguish elements from ${X}$ and ${X'}$ we put a ${'}$ on all quantities from ${X'}$. Moreover, for ${x\in X}$ we denote by ${x'}$ its identical copy in ${X'}$ (and similarly for ${x'\in X'}$, ${x}$ is its identical twin). Then, for any ${\epsilon>0}$ we can define ${D_\epsilon(x,x') = D_\epsilon(x',x) = \epsilon}$ (i.e. the distance between any two identical twins is ${\epsilon}$. By the triangle inequality we get for ${x\in X}$ and ${y'\in X'}$ that ${D_\epsilon(x,y')}$ should fulfill

$\displaystyle D_\epsilon(x',y') - D_\epsilon(x',x) \leq D_\epsilon(x,y') \leq D_\epsilon(x,y) + D_\epsilon(y,y')$

and hence

$\displaystyle d(x,y) - \epsilon \leq D_\epsilon(x,y') \leq d(x,y) + \epsilon.$

Indeed we can choose ${D_\epsilon(x,y') = d(x,y)}$ if ${x\in X}$ and ${y'\in Y}$ leading to one specific metric coupling for any ${\epsilon}$. This couplings allow to distinguish identical twins and behave as a metric on the whole disjoint union. In the limiting case ${\epsilon\rightarrow 0}$ we do not obtain a metric but a semi-metric or pseudo-metric which is just the same as a metric but without the assumption that ${d(x,y) = 0}$ implies that ${x=y}$.

Example 4 The above example of a metric coupling of a metric space with itself was somehow “reproducing” the given metric as accurate as possible. There are also other couplings that put very different distances to points ${D(x,y')}$ and there is also a way to visualize metric couplings: When building the disjoint union of two metric spaces ${X}$ and ${Y}$, you can imagine this as isometrically embedding both in a larger metric space ${Z}$ in a non-overlapping way and obtain the metric coupling ${D}$ as the restriction of the metric on ${Z}$ to ${X\sqcup Y}$. For ${X=Y=[0,1]}$ you can embed both into ${Z = {\mathbb R}^2}$. A metric coupling which is similar (but not equal) to the coupling of the previous example is obtained by putting ${X}$ and ${Y}$ side by side at distance ${\epsilon}$ as here (one space in green, the other in blue).

A quite different coupling is obtained by putting ${X}$ and ${Y}$ side by side, but in a reversed way as here:

You may even embed them in a more weired way as here:

but remember that the embeddings has to be isometric, hence, distortions like here are not allowed.

This example illustrate that the idea of metric coupling is in similar spirit as of “embedding two spaces in a common larger one”.

With the notion of metric coupling, the Gromov-Hausdorff metric can be written as

$\displaystyle d_{GH}(X,Y) = \inf_{R,d} \sup_{(x,y)\in R} d(x,y) \ \ \ \ \ (3)$

where the infimum is taken over all set couplings ${R}$ of ${X}$ and ${Y}$ and all metric couplings ${d}$ of ${(X,d_X)}$ and ${(Y,d_Y)}$.

In words: To compute the Gromov-Hausdorff metric this way, you look for a set coupling of the base sets ${X}$ and ${Y}$ and a metric coupling ${d}$ of the metrics ${d_X}$ and ${d_Y}$ such that the maximal distance of two coupled points ${x}$ and ${y}$ is as small as possible. While this may look more complicated than the original definition from~(2), note that the original definition uses all metric spaces ${Z}$ in which you can embed ${X}$ and ${Y}$ isometrically, which seems barely impossible to realize. Granted, the new definition also considers a lot of quantities.

Also note that this definition is in spirit of the Wasserstein metric from~(1): If there were natural measures ${\mu_R}$ on the set couplings ${R}$ we could write \begin{equation*} d_{GH}(X,Y) = \inf_{R,d} \Big(\int d(x,y)^pd\mu_R\Big)^{1/p} \end{equation*} and in the limit ${p\rightarrow\infty}$ we would recover definition~(3).

Example 5 The Gromov-Hausdorff distance of a metric space ${(X,d_X)}$ to itself is easily seen to be zero: Consider the trivial coupling ${R = \{(x,x)\ :\ x\in X\}}$ from Example~2 and the family ${D_\epsilon}$ of metric couplings from Example~3. Then we have ${d_{GH}(X,X) \leq \epsilon}$ for any ${\epsilon >0}$ showing ${d_{GH}(X,X) = 0}$. Let’s take one of the next-complicated examples and compute the distance of ${X = [0,1]}$ and ${Y=[0,2]}$, both equipped with the euclidean metric. We couple the sets ${X}$ and ${Y}$ by ${R = \{(x,2x)\ : \ x\in X\}}$ and the respective metrics by embedding ${X}$ and ${Y}$ into ${{\mathbb R}^2}$ as follows: Put ${Y}$ at the line from ${(0,0)}$ to ${(2,0)}$ and ${X}$ at the line from ${(\tfrac12,\epsilon)}$ to ${(1\tfrac12,\epsilon)}$:

This shows that ${d_{GH}(X,Y) \leq \tfrac12}$ and actually, we have equality here.

There is another reformulation of the Gromov-Hausdorff metric, the equivalence of which is shown in Theorem 7.3.25 in the book “A Course in Metric Geometry” by Dmitri Burago, Yuri Burago and Sergei Ivanov:

$\displaystyle d_{GH}(X,Y) = \tfrac12\inf_R \sup_{\overset{\overset{x_{1/2}\in X}{y_{1/2}\in Y}}{(x_i,y_i)\in R}}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big| \ \ \ \ \ (4)$

where the infimum is taken over all set couplings ${R}$ of ${X}$ and ${Y}$.

In words: Look for a set coupling such that any two coupled pairs ${(x_1,y_1)}$ and ${(x_2,y_2)}$ have the “most equal” distance.

This reformulation may have the advantage over the form (3) in that is only considers the set couplings and the given metrics ${d_X}$ and ${d_Y}$ and no metric coupling is needed.

Note that, as the previous reformulation~(3), it is also in the spirit of the Wasserstein metric: If there were natural measures ${\mu_R}$ in the set couplings ${R}$, we could write

$\displaystyle d_{GH}(X,Y) = \tfrac12\inf_R \Big(\int_{R\times R}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big|^p d\mu_R(x_1,y_1)d\mu_R(x_2,y_2)\Big)^{1/p}.$

and recover the formulation~(4) in the limit ${p\rightarrow\infty}$.

One may say that we went up an abstraction ladder one step further by moving from ${(X,d)}$ to ${(\mathfrak{C}(X),d_H)}$ to ${(\text{All compact metric spaces},d_{GH})}$.

Since this post has been grown pretty long already, I decided to do the next step (which is the already announced metric on metric spaces which additionally carry some measure on them – so-called metric measure spaces) in a later post.

Today there are several things I could blog on. The first is the planary by Rich Baraniuk on Compressed Sensing. However, I don’t think that I could reflect the content in a way which would be helpful for a potential reader. Just for the record: If you have the chance to visit one of Rich’s talk: Do it!

The second thing is the talk by Bernd Hofmann on source conditions, smoothness and variational inequalities and their use in regularization of inverse problems. However, this would be too technical for now and I just did not take enough notes to write a meaningful post.

As a third thing I have the talk by Christian Clason on inverse problems with uniformly distributed noise. He argued that for uniform noise it is much better to use an ${L^\infty}$ discrepancy term instead of the usual ${L^2}$-one. He presented a path-following semismooth Newton method to solve the problem

$\displaystyle \min_x \frac{1}{p}\|Kx-y^\delta\|_\infty^p + \frac{\alpha}{2}\|x\|_2^2$

and showed examples with different kinds of noise. Indeed the examples showed that ${L^\infty}$ works much better than ${L^2}$ here. But in fact it works even better, if the noise is not uniformly distributed but “impulsive” i.e. it attains bounds ${\pm\delta}$ almost everywhere. It seems to me that uniform noise would need a slightly different penalty but I don’t know which one – probably you do? Moreover, Christian presented the balancing principle to choose the regularization parameter (without knowledge about the noise level) and this was the first time I really got what it’s about. What one does here is, to choose ${\alpha}$ such that (for some ${\sigma>0}$ which only depends on ${K}$, but not on the noise)

$\displaystyle \sigma\|Kx_\alpha^\delta-y^\delta\|_\infty = \frac{\alpha}{2}\|x_\alpha^\delta\|_2^2.$

The rational behind this is, that the left hand side is monotonically non-decreasing in ${\alpha}$, while the right hand side is monotonically non-increasing. Hence, there should be some ${\alpha}$ “in the middle” which make both somewhat equally large. Of course, we do neither want to “over-regularize” (which would usually “smooth too much”) nor to “under-regularize” (which would not eliminate noise). Hence, balancing seems to be a valid choice. From a practical point of view the balancing is also nice because one can use the fixed-point iteration

$\displaystyle \alpha^{n+1} = 2\sigma\frac{\|Kx_{\alpha^n}^\delta - y^\delta\|_\infty}{\|x_{\alpha_n}^\delta\|_2^2}$

which converges in a few number of iterations.

Then there was the talk by Esther Klann, but unfortunately, I was late so only heard the last half…

Last but not least we have the talk by Christiane Pöschl. If you are interested in Total-Variation-Denoising (TV denoising), then you probably have heard many times that “TV denoising preserves edges” (have a look at the Wikipedia page – it claims this twice). What Christiane showed (in a work with Vicent Caselles and M. Novaga) that this claim is not true in general but only for very special cases. In case of characteristic functions, the only functions for which the TV minimizer has sharp edges are these so-called calibrated sets, introduced by Caselles et el. Building on earlier works by Caselles and co-workers she calculated exact minimizers for TV denoising in the case that the image consists of characteristic functions of two convex sets or of a single star shaped domain, that is, for a given set $B$ she calculated the solution of

$\displaystyle \min_u\int (u - \chi_B)^2dx + \lambda \int|Du|.$

This is not is as easy as it may sound. Even for the minimizer for a single convex set one has to make some effort. She presented a nice connection of the shape of the obtained level-sets with the morphological operators of closing and opening. With the help of this link she derived a methodology to obtain the exact TV denoising minimizer for all parameters. I do not have the images right now but be assured that most of the time, the minimizers do not have sharp edges all over the place. Even for simple geometries (like two rectangles touching in a corner) strange things happen and only very few sharp edges appear. I’ll keep you posted in case the paper comes out (or appears as a preprint).

Christiane has some nice images which make this much more clear:

For two circles edges are preserved if they are far enough away from each other. If they are close, the area “in between” them is filled and, moreover, obey this fuzzy boundary. I remember myself seeing effects like this in the output of TV-solvers and thinking “well, it seems that the algorithm is either not good or not converged yet – TV should output sharp edges!”.

For a star-shaped shape (well, actually a star) the output looks like this. The corners are not only rounded but also blurred and this is true both for the “outer” corners and the “inner” corners.

So, if you have any TV-minimizing code, go ahead and check if your code actually does the right things on images like this!
Moreover, I would love to see similar results for more complicated extensions of TV like Total Generalized Variation, I treated here.

In this post I would like to comment on two papers I “stumbled upon”, one in regularization theory and one in image processing.

The first one is A regularization parameter for nonsmooth Tikhonov regularization by Kazufumi Ito, Bangti Jin and Tomoya Takeuchi. As the title announces, the paper addresses the problem of determining suitable regularization parameter for some kind of Tikhonov regularization. In particular, the authors propose a new heuristic method, i.e. method which does not use any estimate of the noise level in the data. This is an interesting and important topic for several reasons:

1. Practically, estimates on the noise level are rarely available and if they are, they are not too reliable.
2. Strictly speaking, these kind of rules are “bad” since there is the “Bakushinksii Veto”: Such rules only provide regularizations (e.g. in the sense of Engl, Hanke, Neubauer for problems which are well-posed (as a great service, the authors state and prove this statement as Theore 3.2).
3. Despite this veto, several heuristic rules produce excellent results in practice.

Note that the last second points are not in contradiction. They merely say that the notion of “regularization” may be too strict. Usually, it uses a worst case estimate which may practically never observed.

The paper contributes a new rule and state that it is applicable to a broad range of problems. They use very general Tikhonov functional:

$\displaystyle \phi(x,y^\delta) + \eta\psi(x)$

and do not assume that ${\phi}$ or ${\psi}$ are smooth. They use the value function

$\displaystyle F(\eta) = \min_x \phi(x,y^\delta) + \eta\psi(x)$

and propose the following rule for ${\eta}$: For some ${\gamma}$ choose ${\eta}$ such that

$\displaystyle \Phi_\gamma(\eta) = \frac{F(\eta)^{1+\gamma}}{\eta}$

is minimal. I do not have any intuition for this rule (however, from they proofs you see that they work, at least for “partially smooth cases”, see below). Lacking a name for this rule, one may use the term “weighted value function rule”.

They prove several nice properties of the value function (continuity, monotonicity and concavity) with loose assumptions on ${\phi}$ and ${\psi}$ (especially they do not even need existence of minimizers for ${\phi(x,y^\delta) + \eta\psi(x)}$, only that the minimum exists). However, when it comes to error estimates, they only obtain results for a specific discrepancy measure, namely a squares Hilbert space norm:

$\displaystyle \phi(x,y^\delta) = \tfrac12\|Kx-y^\delta\|^2.$

It seems that, for general convex and lower-semicontinuous penalties ${\psi}$ they build upon results from my paper with Bangti Jin on the Hanke-Raus rule and the quasi-optimality principle.

Another contribution of the paper is that it gives an algorithm that realizes the weighted value function rule (a thing which I omitted in my paper with Bangti). Their numerical experiments demonstrate that the weighted value function rule and the proposed algorithm works well for academic examples.

The next paper I want to discuss is the preprint Properties of ${L^1-\text{TGV}^2}$: The one-dimensional case by Kristian Bredies, Karl Kunisch and Tuomo Valkonen. There the authors analyze the somehow recent generalization “total generalized variation” ${\text{TGV}}$ of the omnipresent total variation. The TGV has been proposed by Bredies, Kunisch and Pock in this paper recently and Kristian and me also briefly described it in our book on mathematical image processing. Loosely speaking, the TGV shall be a generalization of the usual total variation which does not lead to “staircasing”. While one may observe from the construction of the TGV functional, that staircasing is not to be expected, the authors in this paper give precise statements. By restricting to the one dimensional case they prove several interesting properties of the TGV functional, most notably that it leads to an equivalent norm of the space ${BV}$.

Maybe I should state the definitions here: The total variation of a function ${u\in L^1(\Omega)}$ is

$\displaystyle \text{TV}(u) = \sup\{\int_\Omega u v'\ |\ v\in C^1_c(\Omega),\ \|v\|_\infty\leq 1\}$

leading the the ${BV}$-norm

$\displaystyle \|u\|_{BV} = \|u\|_{L^1} + \text{TV}(u).$

The ${\text{TGV}^2}$ seminorm for a parameter tuple ${(\alpha,\beta)}$ is

$\displaystyle \text{TGV}^2_{(\alpha,\beta)}(u) = \sup\{\int_\Omega u v''\ |\ C^2_c(\Omega), \|v\|_\infty\leq\beta,\ \|v'\|_\infty\leq\alpha\}$

and the associated norm is

$\displaystyle \|u\|_{BGV^2} = \|u\|_{L^1} + \text{TGV}^2(u).$

In Lemma 3.3 they prove that ${\|\cdot\|_{BV}}$ and ${\|\cdot\|_{BGV^2}}$ are equivalent norms on ${\text{BV}}$. In Section 4 they show that minimizers of

$\displaystyle \|u-f\|_{L^1} + \alpha\text{TV}(u)$

obey staircasing of degree 0, i.e. the solution ${u}$ is piecewise constant when it is not equal to ${f}$. For the minimizers of

$\displaystyle \|u-f\|_{L^1} + \text{TGV}^2_{(\alpha,\beta)}(u)$

one has staircasing of degree 1: ${u}$ is affine linear where it is not equal to ${f}$.

These two facts combined (norm equivalence of ${\text{BV}}$ and ${\text{BGV}^2}$ and the staircasing of degree 1) seem quite remarkable to me. They somehow show that staircasing is not related to the space ${\text{BV}}$ of functions of bounded variation but only to the specific ${\text{TV}}$ semi-norm. This is somehow satisfying since I still remember the thorough motivation of L. Rudin in his 1987 thesis for the usage of the space ${\text{BV}}$ in image processing: If there where images which are not in ${\text{BV}}$ we could not observe them. (He even draws an analogy to the question: How many angles can dance on the point of a needle?) Moreover, he further argues that ${\text{BV}}$ is not too large in the sense that its elements are still accessible to analysis (e.g. in defining a weak notion of curvature although they may be discontinuous). The ${\text{BGV}^2}$-model shows that it is possible to overcome the undesired effect of staircasing while staying in the well founded and mathematically sound and appealing framework of ${\text{BV}}$.

The paper contains several more interesting results (e.g. on preservation of continuity and “affinity” and on convergence of with respect to ${(\alpha,\beta)}$ which I do not collect here.