This term I am particularly busy as I am writing a book about variational methods in imaging. The book will be a textbook and I am writing it parallel to a lecture I am currently teaching. (And it is planned that Kristian Bredies, the co-author, teaches the same stuff next term – then there will be another few month of editing, so it will be at least a year until publishing.)

In the book we will treat variational approaches to a variety of basic imaging problems. Of course we treat denoising and deblurring but there will also be sections about image interpolation, segmentation and optical flow. In the first part of the book, we present the variational problem and model them properly in Lebesgue and Sobolev spaces and of course in the space {BV}. Some effort goes into the analysis of the models and the first step is usually to establish existence of solutions, i.e. minimizers of the respective minimization problems. The work horse is the direct method in the calculus of variations and we mainly use the method for convex functionals in Banach spaces.

When I started the section on optical flow I noticed that I hadn’t thought about existence of minimizers before and moreover, most papers and books do not treat this issue. Let’s recall the method of Horn and Schunck to calculate the optical flow:

For two images {u_0}, {u_1} defined on a domain {\Omega\subset{\mathbb R}^2} one seeks a flow field {V:\Omega\rightarrow{\mathbb R}^2} such that {V} the describes the apparent motion that has happened between both images. Assuming that the points keep their gray value during motion (an assumption known as the brightness constancy constraint) and linearizing this assumption one arrives at the condition

\displaystyle  \frac{u_1-u_0}{dt} + V\cdot\nabla u_0 = 0

(where {dt} is the time between the images {u_0} and {u_1}). First, this does not give enough equations to determine {V} and secondly, points with {\nabla u_0=0} are problematic.

Horn and Schunck proposed to loose the constraint and to enforce some smoothness of the flow field {V}: Their model was to minimize

\displaystyle  F(u) = \int_\Omega\big(\frac{u_1-u_0}{dt} + V\cdot\nabla u_0\big)^2{\mathrm d}{x} + \lambda\int_\Omega|\nabla V|^2{\mathrm d}{x}

for some parameter {\lambda} weighting smoothness of {V} (large {\lambda}) against the brightness constancy constraint (small {\lambda}). A little bit more general one could choose exponents {p} and {q} and minimize

\displaystyle  F(u) = \int_\Omega\big|\frac{u_1-u_0}{dt} + V\cdot\nabla u_0\big|^q{\mathrm d}{x} + \lambda\int_\Omega|\nabla V|^p{\mathrm d}{x}.

To apply the direct method to obtain existence of minimizers of {F} one ensures

  1. properness, i.e. there is some {V} such that {F} is finite,
  2. convexity of {F},
  3. lower semi-continuiuty of {F} and
  4. coercivity of {F}.

To check these things one has to choose an appropriate space to work in. It seems reasonable to choose {V\in L^{q}(\Omega,{\mathbb R}^2)}. Then properness of {F} is easy (consider {V=0}, of course assuming that {u_1-u_0\in L^q(\Omega)}). Convexity is also clear and for lower semi-continuity one has to work a little more, but that is possible if, e.g., {\nabla u_0} is bounded. Coercivity is not that clear and in fact {F} is not coercive in general.

Example 1 (Non-coercivity of the Horn-and-Schunck-model) Simply consider {u_0(x,y) = ax + by} for some {a,b\in{\mathbb R}}. Then {\nabla u(x,y) \equiv [a\ b]^T}. Set {V_n(x,y) \equiv [-nb\ na]^T} and note that {\|V^n\|_q\rightarrow\infty} while {F(V^n)} stays bounded (in fact constant).

I just checked the book “Mathematical problems in Imaging” by Gilles Aubert and Pierre Kornprobst and in Section 5.3.2 they mention that the Horn and Schunck model is not coercive. They add another term to {F} which is roughly a weighted norm of {V} which ensures coercivity. However, it turns out that coercivity of {F} is true under a mild assumption of {u_0}. The idea can be found in a pretty old paper by Christoph Schnörr which is called “ Determining Optical Flow for Irregular Domains by Minimizing Quadratic Functionals of a Certain Class” (Int. J. of Comp. Vision, 6(1):25–38, 1991). His argument works for {q=2}:

Theorem 1 Let {\Omega\subset{\mathbb R}^2} be a bounded Lipschitz domain, {u_0,u_1\in L^2(\Omega)} with {\nabla u_0\in L^\infty(\Omega)} such that {\partial_x u_0} and {\partial_y u_0} are linearly independent in {L^2(\Omega)} and let {1<p<\infty}. Then it holds that {F:L^2(\Omega)\rightarrow {\mathbb R}\cup\{\infty\}} defined by

\displaystyle  F(u) = \int_\Omega\big(\frac{u_1-u_0}{dt} + V\cdot\nabla u_0\big)^2{\mathrm d}{x} + \lambda\int_\Omega|\nabla V|^2{\mathrm d}{x}

is coercive.

Proof: Now consider {V^n} such that {\|V^n\|_2\rightarrow\infty}. Now we decompose the components of {V} into the constant parts {QV^n_x} and {QV^n_y} and the “zero-mean”-part {PV^n_x = V^n_x - QV^n_x} and {PV^n_y = V^n_y - QV^n_y}. First consider that {PV^n} is unbounded, i.e. there is subsequence (also denoted by {V^n}) such that {\|PV^n\|_2\rightarrow\infty}. By Sobolev embedding and the \href{http://en.wikipedia.org/wiki/Poincar inequality}, we get that {\int_\Omega|\nabla V^n|^p{\mathrm d}{x}\rightarrow\infty}.

Now consider bounded {PV^n} and hence, unbounded mean values {QV^n}. Using a subsequence, we assume that {QV^n\rightarrow\infty}. Now we use

\displaystyle   \Big\|\frac{u_1 - u_0}{\Delta t} + V\cdot \nabla u_0\Big\|_2 \geq \Big\|QV\cdot\nabla u_0\Big\|_2 - \Big\|\frac{u_1 - u_0}{\Delta t} + PV\cdot \nabla u_0\Big\|_2 \ \ \ \ \ (1)

and estimate the first term from below, noticing that {QV_x} and {QV_y} are constants, by

\displaystyle  \begin{array}{rcl}  \|QV\cdot\nabla u_0\|_2^2 & = &\|QV_x\,\partial_x u_0 + QV_y\,\partial_y u_0\|_2^2\\ & = & \|QV_x\,\partial_x u_0\|_2^2 + \|QV_y\,\partial_y u_0\|_2^2 + 2\langle QV_x\,\partial_x u_0,QV_y\,\partial_y u_0\rangle\\ & \geq &|QV_x|^2\|\partial_x u_0\|_2^2 + |QV_y|^2\|\partial_y u_0\|_2^2\\ &&\qquad - \|QV_x\,\partial_xu_0\|_2\|QV_y\,\partial_yu_0\|_2\,2\frac{|\langle \partial_x u_0,\partial_y u_0\rangle|}{\|\partial_xu_0\|_2\|\partial_y u_0\|_2}\\ & \geq &(|QV_x|^2\|\partial_x u_0\|_2^2 + |QV_y|^2\|\partial_y u_0\|_2^2) \Big(1 - \frac{|\langle \partial_x u_0,\partial_y u_0\rangle|}{\|\partial_xu_0\|_2\|\partial_y u_0\|_2}\Big). \end{array}

Since {\partial_x u_0} and {\partial_y u_0} are linearly independent, it holds that {1 - \frac{|\langle \partial_x u_0,\partial_y u_0\rangle|}{\|\partial_xu_0\|_2\|\partial_y u_0\|_2}>0} and we conclude that {\|QV^{n_k}\|_2\rightarrow\infty} implies that {\|QV^{n_k}\cdot\nabla u_0\|_2^2\rightarrow\infty}. Together with~(1) and boundedness of {PV^{n_k}} we obtain that {F(V^{n_k})\rightarrow\infty}. Since for every subsequence of {V^n} we get another subsequence {V^{n_k}} such that {F(V^{n_k})\rightarrow\infty}, the same conclusion holds for the whole sequence, showing coercivity of {F}. \Box

Basically the same arguments works for {TV} optical flow, i.e. coercivity of

\displaystyle  F(u) = \int_\Omega\big(\frac{u_1-u_0}{dt} + V\cdot\nabla u_0\big)^2{\mathrm d}{x} + \lambda TV(V).

However, I do not know yet what happens for {q\neq 2} and if the result on coercivity is “sharp” in the sense that linear independence of {\partial_x u_0} and {\partial_y u_0} is necessary. Also, I don’t know yet what is true in dimensions higher than {2}.

Here I continue my previous post on methods to compare shapes. In that post we started with different metrics between two probability measures {\mu_1} and {\mu_2} defined on the same set {X}, namely the Prokhorov metric and the Wasserstein metric. Then we also had a look on the Hausdorff metric which measures the distance between two compact subsets {A} and {B} of a metric space {(X,d)} and finally considered the Gromov-Hausdorff metric between two metric spaces {(X,d_X)} and {(Y,d_Y)}. Our tools have been

  • set couplings of {X} and {Y}, i.e. subsets {R\subset X\times Y} such that for all {x\in X} there {y\in Y} such that {(x,y)\in R} and for all {y\in Y} there is {x\in X} such that {(x,y)\in R},
  • metric couplings of {d_X} and {d_Y}, i.e. metrics {d} on the disjoint union of {X} and {Y} such that {d(x,x') = d_X(x,x')} if {x} and {x'} are in {X} and {d(y,y') = d_Y(y,y')} if {y} and {y'} are in {Y}. In fact, one could also work with semi-metrics {d}, i.e. they do not need to be positive definite, and
  • measure couplings of {\mu_X} and {\mu_Y}, i.e. measures {\nu} on {X\times Y} such that {\nu(A\times Y) = \mu_X(A)} and {\nu(X\times B) = \mu_Y(B)} for all {\mu_X}-/{\mu_Y}-measurable set {A} and {B}, respectively.

Now we make the next (and final) step and compare metric spaces {(X,d_X)} and {(Y,d_Y)} which are both equipped with a measure. These objects are known as metric measure spaces or mm-spaces and are formally defined as follows:

Definition 1 (mm-space) A metric measure space is a tripel {(X,d_X,\mu_X)} consisting a compact metric space {(X,d_X)} and a Borel probability measure {\mu_X} on {X}.

Note that sometimes it is included that {\mu_X} has full support (i.e., equal to {X}) in the definition of an mm-space, but it seems that not everybody does it like that.

1. Comparing mm-spaces: Gromov-Wasserstein

Our question is: How to we compare two mm-spaces {(X,d_X,\mu_X)} and {(Y,d_Y,\mu_Y)}? The plan is simply to augment the previous versions of the Gromov-Hausdorff distances defined in my previos post here and here by something which takes the measures on the respective metric spaces into account. We recall both formulations of the Gromov-Hausdorff distance: The first is

\displaystyle   d_{GH}(X,Y) = \inf_{R,d} \sup_{(x,y)\in R} d(x,y) \ \ \ \ \ (1)

where the infimum is taken over all set couplings {R} of {X} and {Y} and metric couplings {d} of {d_X} and {d_Y}, and the second is

\displaystyle   d_{GH}(X,Y) = \tfrac12\inf_R \sup_{\overset{\overset{x_{1/2}\in X}{y_{1/2}\in Y}}{(x_i,y_i)\in R}}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big| \ \ \ \ \ (2)

where the infimum is also taken over all set couplings {R} of {X} and {Y}.

Basically, we have already seen what we should do to build a metric between mm-spaces. The idea is: if there were some “natural measure” {\mu_R} for any set coupling {R} of {X} and {Y}, then we would simply define the distance between {(X,d_X,\mu_X)} and {(Y,d_Y,\mu_Y)} as

\displaystyle  d_{GH}(X,Y) = \inf_{R,d} \Big(\int d(x,y)^pd\mu_R\Big)^{1/p}

(as a generalization of (1)) and

\displaystyle  d_{GH}(X,Y) = \tfrac12\inf_R \Big(\int_{R\times R}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big|^p d\mu_R(x_1,y_1)d\mu_R(x_2,y_2)\Big)^{1/p}

(as a generalization of (2)). In both cases we can have {1\leq p<\infty}. Note that the obvious modification for {p=\infty} leads to something very similar to the Gromov-Hausdorff metrics.

But indeed there are “natural measures” on the set couplings of {X} and {Y}: At least for the full coupling {R= X\times Y} there are the measure couplings {\nu} of {\mu_X} and {\mu_Y}! (One does not need to consider smaller set couplings {R\subsetneq X\times Y} since this can be taken into account by the measure couplings; they do not need to have full support anyway.)

Applying this idea to the version (1) of the Gromov-Hausdorff metric we arrive at the following expression, which can be called Gromov-Wasserstein metric,

\displaystyle   d_{Gw}^1(X,Y) = \inf_{\nu,d} \Big(\int_{X\times Y} d(x,y)^pd\nu(x,y)\Big)^{1/p} \ \ \ \ \ (3)

where the infimum is taken over all measure couplings {\nu} of {\mu_X} and {\mu_Y} and all metric couplings {d} of {d_X} and {d_Y}.

Starting from the version (2) of the Gromov-Hausdorff metric we arrive at another formulation:

\displaystyle   d_{GW}^2(X,Y) = \tfrac12\inf_\nu \Big(\int_{X\times Y}\int_{X\times Y}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big|^p d\nu(x_1,y_1)d\nu(x_2,y_2)\Big)^{1/p} \ \ \ \ \ (4)

where the infimum is taken over all measure couplings {\nu} of {\mu_X} and {\mu_Y}.

While both versions of the Gromov-Hausdorff metric for compact metric spaces where equal, the same is not true for both generalizations to mm-spaces: In his paper Memoli proves that {d_{GW}^2\leq d_{GW}^1} and gives an example (right after Remark 5.14) where strict inequality holds.

2. Comparing mm-spaces: Gromov-Prokhorov

Instead of starting from the Gromov-Hausdorff distance between metric spaces and augmenting their definition with something that takes the measures into account, we could also start from the “>Prokhorov metric between probability measures and augment the definition with something that takes the metric into account. In fact there are also two possibilities to do so: In the appendix of his paper, Memoli quotes this version (from this paper by Greven, Pfaffelhuber and Winter) of a metric between mm-spaces which we call Gromov-Prokhorov metric

\displaystyle   d_{GP}^1(X,Y) = \inf_{\nu,d}\Big\{\epsilon>0\ :\ \nu\{(x,y)\ :\ d(x,y)\geq\epsilon\}\leq\epsilon\Big\} \ \ \ \ \ (5)

where the infimum is taken over all measure couplings {\nu} of {\mu_X} and {\mu_Y} and all metric couplings {d} of {d_X} and {d_Y}.

The next version (also from the same paper by Greven, Pfaffelhuber and Winter where it was called Eurandom metric) is

\displaystyle   d_{GP}^2(X,Y) = \inf_\nu\Big\{\epsilon>0\ :\ \nu\otimes\nu\{(x_1,y_1,x_2,y_2)\ :\ |d_X(x_1,x_2) - d_Y(y_1,y_2)|\geq\epsilon\}\leq\epsilon\Big\} \ \ \ \ \ (6)

where the infimum is taken over all measure couplings {\nu} only.

3. A very simple example

The calculation of the proposed metrics by hand can be quite cumbersome. Let’s look at the simplest example.

We consider metric spaces {X,Y\subset{\mathbb R}^d} (with the euclidean metric) accompanied with the measures {\mu_X=\delta_{x_0}} and {\mu_Y = \delta_{y_0}} for some points {x_0\in X} and {y_0\in Y}. In this case the is only one measure coupling of {\mu_X} and {\mu_Y}, namely

\displaystyle  \nu = \delta_{(x_0,y_0)}.

Now it is easy to calculate the variant (4) of the Gromov-Wasserstein metric:

\displaystyle  \begin{array}{rcl}  d_{GW}^2(X,Y) &=& \tfrac12\Big(\int_{X\times Y}\int_{X\times Y}\big| |x_1-x_2| - |y_1 -y_2|\big|^p d\delta_{(x_0,y_0)}(x_1,y_1)d\delta_{(x_0,y_0)}\Big)^{1/p} \\ &=& \tfrac12\Big(\int_{X\times Y}\big| |x_0-x_2| - |y_0 -y_2|\big|^pd\delta_{(x_0,y_0)}\Big)^{1/p} \\ &=& 0. \end{array}

Let’s have a look at the variant (3): Since there is only one measure coupling, the metric is

\displaystyle  \begin{array}{rcl}  d_{GW}^1(X,Y) & = &\inf_d\Big(\int_{X\times Y} d(x,y)^pd\delta_{(x_0,y_0)}(x,y)\Big)^{1/p} \\ & = &\inf_d d(x_0,y_0). \end{array}

As we have learned in Example 4 in the previous post, we can find a metric coupling of {X} and {Y} that brings the points {x_0} in {X} and {y_0} in {Y} arbitrarily close together (by embedding both {X} and {Y} into some {{\mathbb R}^n} such that these points are only {\epsilon}-far away from each other). Hence, we see that we have

\displaystyle  d_{GW}^1(X,Y) = 0

similarly to {d_{GW}^2}.

Now let’s look at the Gromov-Prokhorov metric from (5). Again we only have one measure coupling and we get

\displaystyle  d_{GP}^1(X,Y) = \inf\Big\{\epsilon>0\ :\ \exists d\ \text{s.t}\ \delta_{(x_0,y_0)}\{(x,y)\ :\ d(x,y)\geq\epsilon\}\leq\epsilon\Big\}.

Since the measure coupling is a Dirac, we can evaluate

\displaystyle  \delta_{(x_0,y_0)}\{(x,y)\ :\ d(x,y)\geq\epsilon\} = \begin{cases} 1 & d(x_0,y_0)\geq \epsilon\\ 0 & d(x_0,y_0)<\epsilon. \end{cases}

As observed previously, there are metric couplings which bring the points {x_0} and {y_0} arbitrarily close together, and hence for any {\epsilon>0} there is a metric coupling such that {\delta_{(x_0,y_0)}\{(x,y)\ :\ d(x,y)\geq\epsilon\} = 0} which shows that

\displaystyle  d_{GP}^1(X,Y) = 0.

Finally, consider the second variant of the Gromov-Prokhorov metric from (6). Since we only have the one measure coupling {\nu = \delta_{(x_0,y_0)}} we have

\displaystyle  d_{GP}^2(X,Y) = \inf\Big\{\epsilon>0\ :\ \delta_{(x_0,y_0)}\otimes\delta_{(x_0,y_0)}\{(x_1,y_1,x_2,y_2)\ :\ \big| |x_1-x_2| - |y_1-y_2|\big|\geq\epsilon\}<\epsilon\Big\}.

Evaluating the tensored Dirac delta is easy: We have that {\delta_{(x_0,y_0)}\otimes\delta_{(x_0,y_0)}\{(x_1,y_1,x_2,y_2)\ :\ \big| |x_1-x_2| - |y_1-y_2|\big|\geq\epsilon\}} is either one or zero and it is one if and only if the point {(x_0,y_0,x_0,y_0)} is in the set that is measured. However, we have that {\big| |x_0-x_0| - |y_0-y_0|\big|} is, of course, always zero (and never larger than any {\epsilon >0}). Hence, the measure is always zero and hence, this version of the Gromov-Prokhorov distance also gives

\displaystyle  d_{GP}^2(X,Y) = 0.

Note that all four metrics can not see that the two Diracs are at different points. The reason seems to be, that one can “deform the metric space outside of the support of the measures” arbitrarily, in some sense.

It seems, that the computation of {d_{GW}^2} and {d_{GP}^2} were easier, since one needs to know all measure couplings and no metric couplings has been involved.

4. Next difficult example: Compare some mm-space to a point

Ok, we have seen that mm-spaces which carry their measure in just a single point all look alike in both versions of the Gromov-Wasserstein metric and also in both versions of the Gromov-Prokhorov metric. Let’s look at a slightly more difficult example: We consider some mm-space {(X,d_X,\mu)} and want to calculate its distance to a single point, i.e. the mm-space {Y=\{0\}} with the only possible metric and measure. This should somehow measure the “size” or “spread” of the mm-space {X}.

First, we need to know all measure couplings and all metric couplings between these spaces. The measure couplings are very easy: There is just one, namely

\displaystyle  \nu = \mu\otimes \delta

(i.e. all subsets of {X\times \{0\}} are treated as if they were subsets of {X}). Concerning metric couplings, there are a few more. We allow semi-metrics {d} on the disjoint union {X\sqcup\{0\}}: Since {d} should respect the metric {d_X} on {X} we see that all metric couplings are parametrized by the points {x_0} in {X} by identifying {0} (the element in {Y}) with this point {x_0}, i.e. all metric couplings are of the form {d_{x_0}} defined by

\displaystyle  d_{x_0}(x,y) = d_X(x,y)\ (\text{for}\ x,y\in X),\qquad d_{x_0}(x,0) = d_X(x,x_0).

(This only gives semi-metrics since we have {d_{x_0}(x_0,0)=0} although, formally {x_0\not 0}.)

Let’s calculate the first Gromov-Wasserstein metric: There is only one measure coupling and we use the parametrization of the metric couplings to deduce

\displaystyle  \begin{array}{rcl}  d_{GW}^1(X,\{0\}) &=& \inf_d\Big(\int_{X\times\{0\}} d(x,0)^pd(\mu\otimes \delta)(x,y)\Big)^{1/p}\\ &=& \inf_{x_0\in X}\Big(\int_{X} d_X(x,x_0)^pd\mu\Big)^{1/p}. \end{array}

This quantity seems to be known as “minimal {p}-th central moment” of {(X,d_X,\mu)}.

The second variant of the Gromov-Wasserstein metric is (remember, there is only one measure coupling)

\displaystyle  \begin{array}{rcl}  d_{GW}^2(X,\{0\}) &=& \tfrac12\Big(\int_{X\times\{0\}}\int_{X\times\{0\}} |d_X(x_1,x_2) - d_Y(0,0)|^p d(\mu\times\delta)(x_1,y_1) d(\mu\times\delta)(x_2,y_2) \Big)^{1/p}\\ &=& \tfrac12\Big(\int_X\int_X d_X(x_1,x_2)^p d\mu d\mu\Big)^{1/p}. \end{array}

This quantity (without the factor {1/2}) is called the “{p}-diameter” of {(X,d_X)}.

Let’s turn to the Gromov-Prokhorov metrics. The first one is (remember, that the metric couplings {d} are parametrized by the points in {X})

\displaystyle  \begin{array}{rcl}  d_{GP}^1(X,\{0\}) &=& \inf_{d}\Big\{\epsilon>0\ :\ (\mu\otimes\delta)\{(x,0)\ :\ d(x,0)\geq\epsilon\}\leq\epsilon\Big\}\\ &=& \inf_{x_0\in X}\Big\{\epsilon>0\ :\ \mu\{x\ :\ d_X(x,x_0)\geq\epsilon\}\leq\epsilon\Big\}. \end{array}

If this looks familiar, then you may have encountered the Ky-Fan metric already? The Ky-Fan metric is a metric between random variables {\xi_1} and {\xi_2} defined of the same probability space {(X,\mu)} with values in a metric space with metric {d}. It reads as

\displaystyle  d_{KF}(\xi_1,\xi_2) = \inf\Big\{\epsilon>0\ :\ \mu\{\omega\ :\ d(\xi_1(\omega),\xi_2(\omega))\geq\epsilon\}\leq\epsilon\Big\}.

Hence, the first version of the Gromov-Prokhorov metric is

\displaystyle  d_{GP}^1(X,\{0\}) = \inf_{x_0\in X}\ d_{KF}(\mathrm{id},x_0),

i.e., the minimal Ky-Fan metric between the identity mapping and the constant mappings. (In other words, it measures how far the identity is from the constant mappings in the sense of Ky Fan.)

The second variant of the Gromov-Prokhorov metric is (remember, the only measure coupling is {\nu = \mu\times \delta})

\displaystyle  \begin{array}{rcl}  d_{GP}^2(X,\{0\}) &=& \inf \Big\{\epsilon>0\ :\ \nu\otimes\nu\{(x_1,0,x_2,0)\ :\ |d_X(x_1,x_2) -0|\geq\epsilon\}\leq\epsilon\Big\}\\ &=& \inf\Big\{\epsilon>0\ :\ \mu\otimes\mu\{(x_1,x_2)\ :\ d_X(x_1,x_2)\geq\epsilon\}\leq\epsilon\Big\}. \end{array}

I do not have a neat name or a good intuition for this metric yet (although it also looks like it measures “size” or “non-localization” of {X} in some sense). If you have one, let me know!

With this post I delve into a topic which is somehow new to me, although I planned to look deeper into this for quite some time already. I stumbled upon the paper Gromov-Wasserstein distances and the metric approach to object matching by Facundo Mémoli which was a pleasure to read and motivated this post.

1. Comparing measures with norms and metrics

There are different notions in mathematics to compare two objects, think of the size of real numbers, the cardinality of sets or the length of the difference of two vectors. Here we will deal with not only comparison of objects but with “measures of similarity”. Two fundamental notions for this are norms in vector spaces and metrics. The norm is the stronger concept in that it uses more structure than a metric and also, every norm induces a metric but not the other way round. There are occasions in which both a norm and a metric are available but lead to different concepts of similarity. One of these instances occurs in sparse recovery, especially in the continuous formulation, e.g. as described in a previous post. Consider the unit interval {I = [0,1]} and two Radon measures {\mu_1} and {\mu_2} on {I} ({I} could also be an aritrary metric space). On the space of Radon measures {\mathfrak{M}(I)} there is the variation norm

\displaystyle \|\mu\|_{\mathfrak{M}}= \sup_\Pi\sum_{A\in\Pi}|\mu(A)|

where the supremum is taken over all partitions {\Pi} of {I} into a finite number of measurable sets. Moreover, there are different metrics one can put on the space of Radon measures, e.g. the Prokhorov metric which is defined for two probability measures (e.g. non-negative ones with unit total mass)

\displaystyle  \begin{array}{rcl}  d_P(\mu_1,\mu_2) & = & \inf\{\epsilon>0\ :\ \mu_1(A)\leq \mu_2(A^\epsilon) + \epsilon,\nonumber\\ & & \qquad \mu_2(A)\leq \mu_1(A^\epsilon) + \epsilon\ \text{for all measurable}\ A\} \end{array}

where {A^\epsilon} denotes the {\epsilon}-neighborhood of {A}. Another familiy of metrics are the Wasserstein metrics: For {p\geq 1} define

\displaystyle d_{W,p}(\mu_1,\mu_2) = \Big(\inf_\nu\int_{I\times I} |x-y|^p d\nu(x,y)\Big)^{1/p} \ \ \ \ \ (1)

where the infimum is taken over all measure couplings of {\mu_1} and {\mu_2}, that is, all measures {\nu} on {I\times I} such that for measurable {A} it holds that

\displaystyle \nu(A\times I) = \mu_1(A)\ \text{and}\ \nu(I\times A) = \mu_2(A).

Example 1 We compare two Dirac measures {\mu_1 = \delta_{x_1}} and {\mu_2 = \delta_{x_2}} located at distinct points {x_1\neq x_2} in {I} as seen here:

058_two_diracsThe variation norm measures their distance as

\displaystyle \|\mu_1-\mu_2\|_{\mathfrak{M}} = \sup_\Pi\sum_{A\in\Pi}|\delta_{x_1}(A) - \delta_{x_2}(A)| = 2

(choose {\Pi} such that it contains {A_1} and {A_2} small enough that {x_1\in A_1}, {x_2\in A_2} but {x_1\notin A_2} and {x_2\notin A_1}). The calculate the Prokhorov metric note that you only need to consider {A}‘s which contain only one of the points {x_{1/2}} and hence, it evaluates to

\displaystyle d_P(\mu_1,\mu_2) = |x_1-x_2|.

For the Wasserstein metric we observe that there is only one possible measure coupling of {\delta_{x_1}} and {\delta_{x_2}}, namely the measure {\nu = \delta_{(x_1,x_2)}}. Hence, we have

\displaystyle d_{W,p}(\mu_1,\mu_2) = \Big(\int_{I\times I}|x-y|^pd\delta_{(x_1,x_2)}(x,y)\Big)^{1/p} = |x_1-x_2|.

The variation norm distinguishes the two Diracs but is not able to grasp the distance of their supports. On the other hand, both metrics return the geometric distance of the supports in the underlying space {I} as distance of the Diracs. Put in pictures: The variation norm of the difference measures the size ob this object

058_two_diracs_difference

while both metrics capture the distance of the measures like here

Two diracs

It should not stay unnoted that convergence in both the Prokhorov metric and the Wasserstein metrics is exactly the weak convergence of probability measures.

The above example provides a motivation to study metric structures on spaces, even if they are also equipped with a norm. Another reason to shift attention from normed spaces to metric spaces is the fact that there has emerged a body of work to build a theory of analysis in metric spaces (see, e.g. this answer on mathoverflow or the book Gradient Flows: In Metric Spaces And In The Space Of Probability Measures by Ambrosio, Gigli and Savaré (which puts special emphasis on the space of probability measures)). Yet another motivation for the study of metrics in this way is the problem of comparing shapes (without being precisely defined yet): Which of these shapes look most alike?

058_shapes

(Note that shapes need not to be two dimensional figures, you may also think of more complex objects like surfaces in three dimensions or Riemannian manifolds.)

One may also ask the question how two compare different images defined on different shapes, i.e. different “distributions of colour” on two different shapes.

2. Comparing shapes: Metric spaces

Up to now we tried to compare different measures, defined on the same set. At least to me it seems that both the Prokhorov and the Wasserstein metrics are suited to measure the similarity of measures and in fact, they do so somehow finer than the usual norm does.

Let’s try to go one step further and ask ourselves, how we could compare two measures {\mu_1} and {\mu_2} which are defined on two different sets? While thinking about an answer one need to balance several things:

  • The setup should be general enough to allow for the comparison of a wide range of objects.
  • It should include enough structure to allow meaningful statements.
  • It should lead to a measure which is easy enough to handle both analytically and computationally.

For the first and second bullet: We are going to work with measures not on arbitrary sets but on metric spaces. This will allow to measure distances between points in the sets and, as you probably know, does not pose a severe restriction. Although metric spaces are much more specific than topological spaces, we still aim at quantitative measures which are not provided by topologies. With respect to the last bullet: Note that both the Prokhorov and the Wasserstein metric are defined as infimums over fairly large and not too well structured sets (for the Prokhorov metric and need to consider all measurable sets and their {\epsilon}-neighborhoods, for the Wasserstein metric, one need to consider all measure couplings). While they can be handled quite well theoretically, their computational realization can be cumbersome.

In a similar spirit than Facundo Memoli’s paper we work our way up from comparing subsets of metric spaces up to comparing two different metric spaces with two measures defined on them.

2.1. Comparing compact subsets of a metric space: Hausdorff

Let {(X,d)} be a compact metric space. Almost hundred years ago Hausdorff introduced a metric on the family of all non-empty compact subsets of a metric space as follows: The Hausdorff metric of two compact subsets {A} and {B} of {X} is defined as

\displaystyle d_H(A,B) = \inf\{\epsilon>0 \ :\ A\subset B_\epsilon,\ B \subset A_\epsilon\}

(again, using the notion of {\epsilon}-neighborhood). This definition seems to be much in the spirit of the Prokhorov metric.

Proposition 2.1 in Facundo Memolis paper shows that the Hausdorff metric has an equivalent description as

\displaystyle d_H(A,B) = \inf_R \sup_{(a,b) \in R} d(a,b)

where the infimum is taken over all correspondences {R} of {A} and {B}, i.e., all subset {R\subset A\times B} such that for all {a\in A} there is {b\in B} such that {(a,b) \in R} and for all {b\in B} there {a\in A} such that {(a,b)\in R}. One may also say set coupling of {A} and {B} instead of correspondence.

Example 2 There is always the full coupling {R = A\times B}. Three different set couplings of two subsets {A} and {B} of the unit interval are shown here:

058_set_coupling

the “full one” {A\times B} in green and two “slim” ones in red and orange. Other “slim” couplings can be obtained from surjective mappings {f:A\rightarrow B} by {R = \{(a,f(a))\ :\ a\in A\}} (or with the roles of {A} and {B} swapped): If you couple a set {A} with itself, there is also the trivial coupling

\displaystyle R = \{(a,a)\ : \ a\in A\}

which is just the diagonal of {A\times A}

Note that the alternative definition of the Hausdorff metric is more in the spirit of the Wasserstein metric: It does not use enlarged objects (by {\epsilon}-neighborhoods) but couplings.

The Hausdorff metric is indeed a metric on the set {\mathfrak{C}(X)} of all non-empty compact subsets of a metric space {X} and if {X} itself is compact it even holds that {(\mathfrak{C}(X),d_H)} is a compact metric space (a result, known as Blaschke Selection Theorem).

One may say that we went up an abstraction ladder one step by moving from {(X,d)} to {(\mathfrak{C}(X),d_H)}.

2.2. Comparing compact metric spaces: Gromov-Hausdorff

In the previous subsection we worked within one metric space {X}. In the book “Metric Structures for Riemannian and Non-Riemannian Spaces” Misha Gromov introduced a notion to compare two different metric spaces. For compact metric space {X} and {Y} the Gromov-Hausdorff metric is defined as

\displaystyle d_{GH}(X,Y) = \inf_{Z,f,g} d_H(f(X),g(Y)) \ \ \ \ \ (2)

where the infimum is taken over

  • all metric spaces {Z} and
  • all isometric embeddings {f} and {g} which embed {X} and {Y} into {Z} respectively.

In words: To compute the Gromov-Hausdorff metric, you try embed both {X} and {Y} into a common larger space isometrically such that they are as close as possible according to the Hausdorff metric in that space.

Strictly speaking, the above definition is not well stated as one can not form an infimum over all metric spaces since this collection does not form a set according to the rules of set theory. More precisely one should write that {d_{GH}(X,Y)} is the infimum over all {r>0} such that there exists a metric space {Z} and isometric embeddings {f} and {g} of {X} and {Y}, respectively, such that {d_H(f(X),g(Y))<r}.

As the Hausdorff metric could be reformulated with set couplings there is a reformulation of the Gromov-Hausdorff metric based on metric couplings: A metric coupling of two metric spaces {(X,d_X)} and {(Y,d_Y)} is a metric {d} on the disjoint union {X\sqcup Y} of {X} and {Y} such that for all {x,x'\in X} and {y,y'\in Y} it holds that {d(x,x') = d_X(x,x')} and {d(y,y') = d_Y(y,y')}.

Example 3 We couple a metric space {(X,d)} with itself. We denote with {(X',d')} an identical copy of {(X,d)} and look for a metric {D} on {X\times X'} that respects the metrics {d} and {d'} in the way a metric coupling has to.

To distinguish elements from {X} and {X'} we put a {'} on all quantities from {X'}. Moreover, for {x\in X} we denote by {x'} its identical copy in {X'} (and similarly for {x'\in X'}, {x} is its identical twin). Then, for any {\epsilon>0} we can define {D_\epsilon(x,x') = D_\epsilon(x',x) = \epsilon} (i.e. the distance between any two identical twins is {\epsilon}. By the triangle inequality we get for {x\in X} and {y'\in X'} that {D_\epsilon(x,y')} should fulfill

\displaystyle D_\epsilon(x',y') - D_\epsilon(x',x) \leq D_\epsilon(x,y') \leq D_\epsilon(x,y) + D_\epsilon(y,y')

and hence

\displaystyle d(x,y) - \epsilon \leq D_\epsilon(x,y') \leq d(x,y) + \epsilon.

Indeed we can choose {D_\epsilon(x,y') = d(x,y)} if {x\in X} and {y'\in Y} leading to one specific metric coupling for any {\epsilon}. This couplings allow to distinguish identical twins and behave as a metric on the whole disjoint union. In the limiting case {\epsilon\rightarrow 0} we do not obtain a metric but a semi-metric or pseudo-metric which is just the same as a metric but without the assumption that {d(x,y) = 0} implies that {x=y}.

Example 4 The above example of a metric coupling of a metric space with itself was somehow “reproducing” the given metric as accurate as possible. There are also other couplings that put very different distances to points {D(x,y')} and there is also a way to visualize metric couplings: When building the disjoint union of two metric spaces {X} and {Y}, you can imagine this as isometrically embedding both in a larger metric space {Z} in a non-overlapping way and obtain the metric coupling {D} as the restriction of the metric on {Z} to {X\sqcup Y}. For {X=Y=[0,1]} you can embed both into {Z = {\mathbb R}^2}. A metric coupling which is similar (but not equal) to the coupling of the previous example is obtained by putting {X} and {Y} side by side at distance {\epsilon} as here (one space in green, the other in blue).

058_metric_coupling_embedding1

A quite different coupling is obtained by putting {X} and {Y} side by side, but in a reversed way as here:

058_metric_coupling_embedding2

You may even embed them in a more weired way as here:

058_metric_coupling_embedding3

but remember that the embeddings has to be isometric, hence, distortions like here are not allowed.

058_metric_coupling_embedding4This example illustrate that the idea of metric coupling is in similar spirit as of “embedding two spaces in a common larger one”.

With the notion of metric coupling, the Gromov-Hausdorff metric can be written as

\displaystyle d_{GH}(X,Y) = \inf_{R,d} \sup_{(x,y)\in R} d(x,y) \ \ \ \ \ (3)

where the infimum is taken over all set couplings {R} of {X} and {Y} and all metric couplings {d} of {(X,d_X)} and {(Y,d_Y)}.

In words: To compute the Gromov-Hausdorff metric this way, you look for a set coupling of the base sets {X} and {Y} and a metric coupling {d} of the metrics {d_X} and {d_Y} such that the maximal distance of two coupled points {x} and {y} is as small as possible. While this may look more complicated than the original definition from~(2), note that the original definition uses all metric spaces {Z} in which you can embed {X} and {Y} isometrically, which seems barely impossible to realize. Granted, the new definition also considers a lot of quantities.

Also note that this definition is in spirit of the Wasserstein metric from~(1): If there were natural measures {\mu_R} on the set couplings {R} we could write \begin{equation*} d_{GH}(X,Y) = \inf_{R,d} \Big(\int d(x,y)^pd\mu_R\Big)^{1/p} \end{equation*} and in the limit {p\rightarrow\infty} we would recover definition~(3).

Example 5 The Gromov-Hausdorff distance of a metric space {(X,d_X)} to itself is easily seen to be zero: Consider the trivial coupling {R = \{(x,x)\ :\ x\in X\}} from Example~2 and the family {D_\epsilon} of metric couplings from Example~3. Then we have {d_{GH}(X,X) \leq \epsilon} for any {\epsilon >0} showing {d_{GH}(X,X) = 0}. Let’s take one of the next-complicated examples and compute the distance of {X = [0,1]} and {Y=[0,2]}, both equipped with the euclidean metric. We couple the sets {X} and {Y} by {R = \{(x,2x)\ : \ x\in X\}} and the respective metrics by embedding {X} and {Y} into {{\mathbb R}^2} as follows: Put {Y} at the line from {(0,0)} to {(2,0)} and {X} at the line from {(\tfrac12,\epsilon)} to {(1\tfrac12,\epsilon)}:

058_gromov_hausdorff_12

This shows that {d_{GH}(X,Y) \leq \tfrac12} and actually, we have equality here.

There is another reformulation of the Gromov-Hausdorff metric, the equivalence of which is shown in Theorem 7.3.25 in the book “A Course in Metric Geometry” by Dmitri Burago, Yuri Burago and Sergei Ivanov:

\displaystyle d_{GH}(X,Y) = \tfrac12\inf_R \sup_{\overset{\overset{x_{1/2}\in X}{y_{1/2}\in Y}}{(x_i,y_i)\in R}}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big| \ \ \ \ \ (4)

where the infimum is taken over all set couplings {R} of {X} and {Y}.

In words: Look for a set coupling such that any two coupled pairs {(x_1,y_1)} and {(x_2,y_2)} have the “most equal” distance.

This reformulation may have the advantage over the form (3) in that is only considers the set couplings and the given metrics {d_X} and {d_Y} and no metric coupling is needed.

Note that, as the previous reformulation~(3), it is also in the spirit of the Wasserstein metric: If there were natural measures {\mu_R} in the set couplings {R}, we could write

\displaystyle d_{GH}(X,Y) = \tfrac12\inf_R \Big(\int_{R\times R}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big|^p d\mu_R(x_1,y_1)d\mu_R(x_2,y_2)\Big)^{1/p}.

and recover the formulation~(4) in the limit {p\rightarrow\infty}.

One may say that we went up an abstraction ladder one step further by moving from {(X,d)} to {(\mathfrak{C}(X),d_H)} to {(\text{All compact metric spaces},d_{GH})}.

Since this post has been grown pretty long already, I decided to do the next step (which is the already announced metric on metric spaces which additionally carry some measure on them – so-called metric measure spaces) in a later post.

If mathematicians study an object they sometimes study the properties of the object. E.g. you can study the object “2” (that is the natural number 2) by saying that it denotes a certain cardinality, name the cardinality of two things. Well, that does not give you anything new. However, mathematicians often study objects by studying what either the object can do to other things or what other things can do the objects. In our (quite silly) example of the number two you see that you can use 2 to add it to other numbers (or, conversely, you can add other numbers to 2). One may feel that this does not add any insight but for objects which are more complicated (e.g. like sets of other objects or sets of these) this view can be particularly useful: If you consider a normed vector space {X}, you can ask either for properties of the space itself (e.g. completeness) or you ask how one can map the objects of the space to other sets. Using the underlying ingredients to give more structure on the mappings under consideration this gives new constructions: In the case of a normed vector space you have the underlying field {K}. This leads to mappings from {X} to {K}. You also have a linear structure on {X}. This leads to linear mappings from {X} to {K}. Finally, there is a norm on {X} and if there is an absolute value on {K} this leads you to bounded linear mappings from {X} to {K}. Any guess what: These objects form the so-called dual space.

Dual spaces of (say) Banach spaces are telling you new things about the space itself and it seems to me that if you want to understand the nature of a Banach space you need to know the dual space (or a space of which your space is the dual, i.e. a pre-dual space).

This post shall not be about duality in general but on dual spaces of a particular type of spaces, namely of spaces which consist of continuous functions. While these spaces usually form nice Banach spaces, they are not as simple as Lebesgue spaces or Sobolev spaces in that there are no Hilbert spaces among them and even no reflexive spaces.

1. What spaces of continuous functions?

For sets {\Omega} and {V} one can study continuous functions {f:\Omega\rightarrow V} as soon as both {\Omega} and {V} are topological spaces. For these functions to form a vector space we also want that {V} is a vector space over a field {K}. To get a normed space of continuous functions we endow {V} with a norm {\|\cdot\|} and try to define a norm of {f:\Omega \rightarrow V} as

\displaystyle \|f\|_\infty = \sup_{x\in\Omega}\|f(x)\|.

But wait: The supremum may not exist. We better add the assumption that {f} is bounded. Now this gives us a normed space and we denote:

\displaystyle C_b(\Omega,V) = \{f:\Omega\rightarrow V\ |\ f\ \text{continuous and bounded}\}.

Together with the norm {\|\cdot\|_\infty} this gives a Banach space as soon as {V} is a Banach space. In the case that {\Omega} is compact, we can omit the condition of boundedness and obtain

\displaystyle C(\Omega,V) = \{f:\Omega\rightarrow V\ |\ f\ \text{continuous}\}

and again get a Banach space as soon as {V} is a Banach space.

Instead of assuming compactness of the whole space {\Omega} we could shift this property to all functions in this space. This leads to

\displaystyle C_c(\Omega,V) = \{f:\Omega\rightarrow V\ |\ f\ \text{continuous with compact support}\}.

This does not lead to a Banach space in the case that {\Omega} is not compact. Indeed, there is a last possibility by considering the closure of {C_c(\Omega,V)} with respect to {\|\cdot\|_\infty} leading to the space

\displaystyle C_0(\Omega,V) = \overline{C_c(\Omega,V)}^{\|\cdot\|_\infty}.

Informally one says that this are the continuous function “that vanish on the boundary of {\Omega}”.

All these spaces admit dual spaces and all these spaces are described by some kind of “Riesz representation theorem”. These dual spaces consists of some kind of measures and the dual pairing is then defined as integrating the continuous functions with respect to these measures. Here I want to clarify (at least for myself) what these dual spaces are and how they are related.

The case in which {V=\mathbb{K}} (i.e. the real or complex numbers) is a little easier and we omit the argument {V} in the spaces in this case: {C_b(\Omega) = C_b(\Omega,\mathbb{K})}, {C(\Omega) = C(\Omega,\mathbb{K})},…

2. The dual of {C_b(\Omega)} for normal {\Omega}

We deal with real or complex valued functions on a set {\Omega} and assume that {\Omega} in a normal topological space (also called {T_4}-space), that is the points in {\Omega} are all closed and two disjoint closed sets can be separated by open neighborhoods. Of course a metric space {\Omega} is a normal topological space.

2.1. Regular bounded finitely additive measures

To see what the dual space of {C_b(\Omega)} could be, we have to think of integrating a bounded continuous functions against a measure {\mu}. Here the topological structure of {\Omega} comes into play because it naturally leads to a {\sigma}-algebra on {\Omega}, namely the one which is generated by the closed sets in {\Omega} (or equivalently be the open sets) which is also called the Borel algebra {B(\Omega)} on {\Omega}. On a {\sigma}-algebra we can define a set function (which shall be a measure) by mapping the elements in {B(\Omega)} to real (or complex) numbers. Such a set function {\mu} is called bounded if {|\mu(E)|<\infty} for all {E\in B(\Omega)} and finitely additive if for any finite number of disjoint set {E_1,\dots,E_n\in B(\Omega)} it holds that

\displaystyle \mu(\cup E_i) = \sum \mu(E_i). \ \ \ \ \ (1)

 

The set of all bounded and finitely additive set functions (nowadays often called bounded finitely additive measures) is denoted by {ba(\Omega)}. Equipped with the variational norm {|\mu| = |\mu|(\Omega)}, {ba(\Omega)} becomes a Banach space. The space {ba(\Omega)} is a fairly large space of measures: Note that we did neither assume that the member are countably additive nor that the values {\mu(E)} shall be non-negative. However, we assumed the all values {|\mu(E)|} are finite which excludes, e.g. the Lebesgue-measure on {\Omega=\mathbb{R}} (while weighted Lebesgue measures can be allowed if the weight is integrable). The space {ba(\Omega)} is slightly too large to be the dual of {C_b(\Omega)} and we need another restriction. We call an element {\mu\in ba(\Omega)} regular, if for any {E \in B(\Omega)} and any {\epsilon>0} there exists a closed set {F\subset E} and an open set {G\supset E} such that for every {C\subset G\setminus F} it holds that {|\mu(C)|<\epsilon}. The space of regular bounded finitely additive measures is denoted by {rba(\Omega)} and is closed subspace of {ba(\Omega)}

2.2. Intgration with respect to {\mu\in rba(\Omega)}

Now we can define the integral of a bounded continuous function {f\in C_b(\Omega)} with respect to a regular bounded finitely additive measure {\mu\in rba(\Omega)} as follows: Since {f(\Omega)} is a bounded set in {\mathbb{K}} we can cover is with open sets {G_1,\dots, G_n} with diameter less than a given {\epsilon>0}. Define {A_1=G_1} and {A_j = G_j\setminus \bigcup_{i=1}^{j-1} G_i}. If {A_i} is not empty, choose {\alpha_j\in A_j} (otherwise choose {\alpha_j=0}). Since {f} is continuous, the sets {B_j = f^{-1}(A_j)\subset\Omega} are in {B(\Omega)} and we can define

\displaystyle f_\epsilon = \sum_{j=1}^n \alpha_j \chi_{B_j}

which is an {\epsilon}-approximation to {f} and indeed, {f} is the uniform limit of the {f_\epsilon}. For each {f_\epsilon}, the integral is naturally defined as

\displaystyle \int f_\epsilon d\mu = \sum \alpha_j \mu(B_j)

and the limit of this expression is used as integral for {f}.

2.3. Duality

Since it holds that

\displaystyle \Big| \int f d\mu\Big| \leq \|f\|_\infty |\mu|

every {\mu\in rba(\Omega)} defines a continuous linear functional an {C_b(\Omega)}. This shows that the dual space of {C_b(\Omega)} contains {rba(\Omega)}. Indeed both spaces are equal:

Theorem 1 For a normal topological space {\Omega} it holds that

\displaystyle C_b(\Omega)^* = rba(\Omega).

The proof is lengthy and technical and the prime reference is “Linear Operators” by Dunford and Schwartz (Theorem IV.6.2).

3. The dual of {C(\Omega)} for compact Hausdorff {\Omega}

Now we assume that {\Omega} is a compact Hausdorff space (think of closed and bounded subsets of {\mathbb{R}^d}). Due to compactness we can omit the boundedness assumption on our continuous functions (they are bounded anyway).

3.1. Regular bounded countably additive measures

We move from finitely additive measures to countably additive measure that is, (1) holds for countably many disjoint sets {E_i}. Together with finiteness and regularity we arrive at the space {rca(\Omega)} of regular bounded countably additive measures. In our case, where {\Omega} is also a topological space and the {\sigma}-algebra is the Borel algebra this space is also called the space of regular Borel measures} or Radon measure} and often denoted by {\mathfrak{M}(\Omega)}.

3.2. Duality

Here we have the following representation theorem:

Theorem 2 If {\Omega} is compact and Hausdorff it holds that

\displaystyle C(\Omega)^* = \mathfrak{M}(\Omega)\ (=rca(\Omega)).

By

\displaystyle f\mapsto \int f d\mu

every {\mu\in\mathfrak{M}(\Omega)} determines an element in {C(\Omega)^*}. Moreover, it is clear that {rba(\Omega)\supset C(\Omega)^*}. It remains to show that every {\lambda\in rba(\Omega)} determines a {\mu\in rca(\Omega) = \mathfrak{M}(\Omega)} such that for all {f\in C(\Omega)} it holds that {\int fd\lambda = \int fd\mu}.

4. The dual of {C_0(\Omega)} for locally compact Hausdorff {\Omega}

This case is basically the same as for {C(\Omega)} with compact Hausdorff {\Omega}. The absence of compactness is compensated by the fact that the function “vanish on the boundary”. Well, “vanishing on the boundary” really means, that the function can be approximated by continuous function with compact support (in the {\infty}-norm) and hence, the result do not differ from the previous one:

Theorem 3 If {\Omega} is locally compact and Hausdorff it holds that

\displaystyle C_0(\Omega)^* = \mathfrak{M}(\Omega).

Final remarks:

  • It turned out that the situation is easier than I expected. Basically there are two cases: Bounded continuous functions on a normal set {\Omega}. Here we get the regular bounded and finitely additive measures as the dual space. The case are continuous functions on a compact space or continuous function which “vanish at the boundary” on a locally compact space. In both we arrive at cases regular bounded countably additive measure aka Radon measures.

 

  • For {\Omega\subset{\mathbb R}^d} the distinction is:
    1. {\Omega} arbitrary and bounded continuous functions give {rba(\Omega)}.
    2. {\Omega} compact and continuous functions give {\mathfrak{M}(\Omega)}.
    3. {\Omega} arbitrary and continuous functions vanishing on {\partial\Omega} also give {\mathfrak{M}(\Omega)}.

     

  • There are a lot of excellent references for the Riesz representation theorem on the web and you’ll easily find several proofs. However, the proof is always technical and lengthy. One attempt of a short proof in the AMM-article “The Riesz Representation Theorem Revisited”was successful, however uses several heavy tools: Stone-Cech compactification, Caratheodory extension and a result about clopen sets in extremally disconnected spaces. 

 

Concerning the fact that weak and strong convergence coincide on {\ell^1} (also know as Schur’s Theorem) I asked about an elementary proof in my last post. And in fact Markus Grasmair send me one which I’d like to present here. It is indeed elementary as it does not use any deep mathematics – however, it is a bit tricky.

Theorem 1 (Schur’s Theorem) A sequence {(x^k)} in {\ell^1} converges weakly iff it converges strongly.

Proof: As always, strong convergence implies weak convergence. To proof the opposite we assume that {x^k\rightharpoonup 0} but {\|x^k\|_1\geq\epsilon} for some {\epsilon>0}. From this assumption we are going to derive a contradiction by constructing a vector {f\in\ell^\infty} with unit norm and a subsequence {x^{k_l}} such that {\langle f,x^{k_l}\rangle} does not converge to zero. We initialize {j_0=0}, set {k_1=1}, choose {j_1\in{\mathbb N}} such that {\sum_{j>j_1}|x^1_j|\leq \epsilon/6} and define the first {j_1} entries of {f} as

\displaystyle  f_j = \text{sign}(x^1_j)\ \text{ for }\ 1\leq j\leq j_1.

Now we proceed inductively and assume that for some {l\geq 1} the numbers {j_1,\dots,j_l} the subsequence {x^1,\dots,x^{k_l}} and the entries {f_1,\dots,f_{j_l}} have already been constructed and fulfill for all {m\leq l}

\displaystyle  \Big|\sum_{j\leq j_{m-1}} f_j x^{k_m}_j\Big| \leq \frac{\epsilon}{6} \ \ \ \ \ (1)

\displaystyle  \sum_{j=j_{m-1}+1}^{j_m} f_j x^{k_m}_j\geq \frac{2\epsilon}{3} \ \ \ \ \ (2)

\displaystyle  \sum_{j>j_m}|x^{k_m}_j|\leq \frac{\epsilon}{6}. \ \ \ \ \ (3)

(Note that for {l=1} these conditions are fulfilled: (1) is fulfilled since the sum is empty, (2) is fulfilled since {\sum_{j=1}^{j_1}f_j x^1_j = \|x\|_1 - \sum_{j>j_1}|x^1_j|>5\epsilon/6} and (3) is fulfilled by definition.) To go from a given {l} to the next one, we first observe that {x^k\rightharpoonup 0} implies that for all {j} it holds that {x^k_j\rightarrow 0}. Hence, we may take {k_{l+1}} such that {\sum_{j\leq j_l} |x^{k_{l+1}}_j|\leq \epsilon/6} and of course we can take {k_{l+1}>k_l}. Since {x^{k_{l+1}}} is a summable sequence, we find {j_{l+1}} such that {\sum_{j>j_{l+1}}|x^{k_{l+1}}_j|<\epsilon/6} and again we may take {j_{l+1}>j_l}. We set

\displaystyle  f_j = \text{sign}(x^{k_{l+1}}_j)\ \text{ for }\ j_l\leq j\leq j_{l+1}

and observe

\displaystyle  \sum_{j = j_l+1}^{j_{l+1}} f_j x^{k_{l+1}}_j = \sum_{j = j_l+1}^{j_{l+1}} |x^{k_{l+1}}_j| > \|x^{k_{l+1}}\|_1 - \frac{\epsilon}{3} > \frac{2\epsilon}{3}.

By construction, the properties (1), (2) and (3) are fulfilled for {l+1}, and we see that we can continue our procedure ad infinitum. For the resulting {f\in\ell^\infty} (indeed {\|f\|_\infty=1}) and the subsequence {(x^{k_l})} we obtain (using the properties (1), (2) and (3)) that

\displaystyle  \begin{array}{rcl}  \langle f,x^{k_l}\rangle &=& \sum_j f_j x^{k_l}_j\\ & = & \sum_{j\leq j_{l-1}} f_j x^{k_l}_j + \sum_{j= j_{l-1}+1}^{j_l} f_j x^{k_l}_j + \sum_{j> j_l} f_j x^{k_l}_j\\ & \geq & -\Big|\sum_{j\leq j_{l-1}} f_j x^{k_l}_j \Big| + \sum_{j= j_{l-1}+1}^{j_l} f_j x^{k_l}_j - \sum_{j> j_l} |x^{k_l}_j|\\ & \geq& - \frac{\epsilon}{6} + \frac{2\epsilon}{3} - \frac{\epsilon}{6}\geq \frac{\epsilon}{3}. \end{array}

\Box

Although this proof is really elementary (you only need to know what convergence means and have a proper {\epsilon}-handling) I found it hard to digest. I think that this is basically not avoidable – Schur’s Theorem seems to be one of these facts which easy to remember, hard to believe and hard to prove.

Let’s try a direct proof that {x^k\rightharpoonup 0} implies {x^k\rightarrow 0} in {\ell^1}.

Proof: We know that a weakly convergent sequence in {\ell^1} is pointwise convergent (by testing with the canonical basis vectors), hence for every {j} it holds that {x^k_j\rightarrow 0} for {k\rightarrow\infty}. For some {J\in{\mathbb N}} we write

\displaystyle  \|x^k\|_1 = \sum_{j< J} |x^k_j| + \sum_{j\geq J} |x^k_j|.

By pointwise convergence of {x^k_j} we know that the first sum converges to zero for every {J}. Hence, we can proceed as follows: For a given {\epsilon>0} first choose {J} so large such that {\sum_{j\geq J} |x^k_j|\leq \epsilon/2} for all {k} and then choose {k} so large that { \sum_{j< J} |x^k_j|\leq \epsilon/2}. This sounds nice but wait! How about the choice of {J}? The tails of the series shall be bounded by {\epsilon/2} uniformly in {k}. That sounds possible but not obvious and indeed this is the place where the hard work starts! Indeed, one can prove that this choice of {J} is possible by contradiction: Assume that there exists an {\epsilon} such that for all {J} there exists a {k} such that {\sum_{j\geq J}|x^k_j|>\epsilon}. Alas, this is same thing which we have proven in the proof of Markus… \Box

Indeed, the construction Markus made in his proof can be generalized to obtain the Nikodym convergence theorem:

Theorem 2 Let {(\sigma_n)} be a sequence of signed finite measures for which it holds that {\sigma_n(E)\rightarrow 0} for every measurable set {E} (which is equivalent to the weak convergence of {\mu_n} to zero). Then the sequence {(\sigma_n)} is uniformly countably additive, i.e. for any sequence {(E_m)} of disjoint measurable sets the series {\sum_{m=1}^\infty |\sigma_n(E_m)|} converges uniformly in {n}.

One may compare this alternative proof with Exercise 11 (right after Theorem 5 [Nikodym convergence theorem]) in this blog post by Terry where one shall take a similar path to proof Schur’s Theorem.

This morning I was looking for the book “Vector measures” by Diestel and Uhl in our local math library. Although it was not there I found the book “Sequences and Series in Banach Spaces” by Joseph Diestel. This (very nicely written) book on somehow advanced theory of Banach spaces contains a number of facts of the type “yeah, that is true, although not easy to see; however, I do not know any reference for…”.

1. The dual of {\ell^\infty}

The question on how the dual space of {\ell^\infty} looks like seems to be of interest for many people (e.g. questions on it are frequently asked on the web, e.g. at math.stackexchange). Diestel’s book has the chapter “The Classical Banach Spaces” and in the section “The Classical Nonreflexive Sequence Spaces” he describes helpful properties of the space {c_0}, {\ell^1} and {\ell^\infty}.

The spaces {c_0} and {\ell^\infty} have the property that maps into them can be extended to larger domains:

Theorem 1 Let {Y} be a linear subspace of a Banach space {X} and let {T:Y\rightarrow\ell^\infty} be bounded and linear. Then there is an extension {S:X\rightarrow\ell^\infty} of {T} having the same norm as {T}.

Proof: Since {T} is bounded, it holds for all {n\in{\mathbb N}} that {|Ty|_n\leq \|Ty\|_\infty\leq c\|y\|_Y} and hence the mappings {y\mapsto (Ty)_n} are all elements in the space {Y^*}—we denote them by {y^*_n}. This gives the representation

\displaystyle  Ty = (y^*_n(y))_n.

By Hahn-Banach there exist the extensions {x^*_n} of {y^*_n} onto {X} (with equal norm) and hence, an extension of {T} is given by

\displaystyle  Sx = (x^*_n(x))_n

works. \Box

For {c_0} something similar holds but only for separable spaces {X} (with a more involved proof):

Theorem 2 Let {Y} be a linear subspace of a separable Banach space {X} and let {T:Y\rightarrow c_0} be bounded and linear. Then there is an extension {S:X\rightarrow c_0} of {T}.

Then Diestel moves on to the dual space of {\ell^\infty}: {ba(2^{\mathbb N})} (which stands for the bounded and (finitely) additive measures). Although this topic is also treated in other classical book (as “Linear Operators” by Dunford and Schwartz), the exposition in Diestel’s book was the most lively I came across.

Start with an {x^*\in (\ell^\infty)^*}. For a subset {\Delta} of the natural numbers the characteristic function {\chi_\Delta} belongs to {\ell^\infty} and hence, we can evaluate {x^*(\chi_\Delta)}. Of course, {x^*(\chi_\Delta)} is disjoint additive in {\Delta} and for disjoint {\Delta_1,\dots,\Delta_n} we have

\displaystyle  \begin{array}{rcl}  \sum_{i=1}^n |x^*(\chi_{\Delta_i})| &= &\sum_{i=1}^n x^*(\chi_{\Delta_i})\, \text{sgn}(x^*(\chi_{\Delta_i}))\\ & = &x^*(\sum_{i=1}^n \text{sgn}(x^*(\chi_{\Delta_i}))\,\chi_{\Delta_i})\\ & \leq &\|x^*\|\,\|\sum_{i=1}^n \text{sgn}(x^*(\chi_{\Delta_i}))\,\chi_{\Delta_i}\|_\infty\leq\|x^*\|. \end{array}

This shows that {(\ell^\infty)^*} indeed contains finitely additive measures. I’d like to quote Diestel directly:

A scalar-valued measure being bounded and additive is very like a countably additive measure and is not [...] at all pathological.

Now we denote by {ba(2^{\mathbb N})} the Banach space space of bounded additive scalar-valued measures on {{\mathbb N}} endowed is with the variational norm {\|\cdot\|_{ba}} (which can be defined through the Hahn-Jordan decomposition and is, in a nutshell, the measure of the positive set plus the measure of the negative set)

For {\mu\in ba(2^{\mathbb N})} and disjoint finite subsets {\Delta_k} of the natural numbers it holds for every {n} that

\displaystyle  \sum_{i=1}^n |\mu(\Delta_i)|\leq \|\mu\|_{ba}

and hence

\displaystyle  \sum_{i=1}^\infty |\mu(\Delta_i)|\leq \|\mu\|_{ba}.

We see that {\mu} adds up the measures of countably many disjoint sets, especially, {\sum_{i=1}^\infty \mu(\Delta_i)} is an absolutely convergent sequence. However, the sum does not have to be the right one: {\sum_{i=1}^\infty \mu(\Delta_i)} may be smaller than {\mu(\bigcup_{i=1}^\infty \Delta_i)}. Diestel says that it is not fair to blame the measures {\mu} for this probable defect but it is “a failure on the part of the underlying field of sets”. He goes on the make this precise by the use of the Stone representation theorem which assigns to any Boolean algebra (in our case the algebra {\Sigma=2^{\mathbb N}} of subsets of {{\mathbb N}}) the appropriately topologized set of ultrafilters and in this set he considers the Boolean algebra {\mathcal{S}(\Sigma)} of the sets which are “clopen” (which is isomorphic to the original algebra {\Sigma}). He then considers {\mu} not acting on {\Sigma} but its identical twin {\hat\mu} on {\mathcal{S}(\Sigma)} and shows that there one needs to work with the closure of the union of sets and this makes the identical twin of {\mu} even countably additive. In this sense, the lack of countable additivity is not a failure of {\mu} but of {\Sigma}, so to say.

2. Weak convergence in {\ell^1} is strong

As a unique feature of {\ell^1} I’d like to quote Diestel again:

Face it: the norm of a vector {\sum_n t_n e_n} in {\ell^1} is as big as it can be ({\|\sum_n t_n e_n\|_1 = \sum_n |t_n|}) if respect for the triangle inequality and the “unit” vector is to be preserved.

The following theorems hold:

Theorem 3 For every separable Banach space {X} there exists a bounded and linear operator {T:X\rightarrow\ell^1} which is onto.

Theorem 4 Let {X} be a Banach space and {T:X\rightarrow\ell^1} bounded, linear and onto. Then {X} contains a subspace that is isomorphic to {\ell^1} and complemented.

Probably the most stunning fact about {\ell^1} is that weak and strong convergence coincide. To show this Diestel used heavy machinery, namely Phillips’ Lemma.

Lemma 5 (Phillips’ Lemma) Let {\mu_n} be in {ba(2^{\mathbb N})} and satisfy {\lim_{n\rightarrow\infty}\mu_n(\Delta)=0} for every {\Delta\subset{\mathbb N}}. Then

\displaystyle  \lim_n \sum_k |\mu_n(\{k\})|=0.

Theorem 6 (Schur’s Theorem) In {\ell^1} it holds that {x_n\rightharpoonup x} iff {x_n\rightarrow x}.

Proof: As always, strong convergence implies weak convergence and we only have to show the converse. Of course we considered {\ell^1=\ell^1({\mathbb N})} and hence, by canonical embedding into the double dual, there is for every {x\in\ell^1} a {\mu_x\in ba(2^{\mathbb N})} and for every {\Delta\subset {\mathbb N}} it holds that

\displaystyle  \mu_x(\Delta) = \sum_{k\in\Delta} x(k).

Let {(x_n)} be a weak null sequence in {\ell^1}. Then it holds that

\displaystyle  \begin{array}{rcl}  \lim_{n\rightarrow\infty} \mu_{x_n}(\Delta) &=&\lim_{n\rightarrow\infty}\sum_{k\in\Delta}x_n(k)\\ & = & \lim_{n\rightarrow\infty}\langle \chi_\Delta,x_n\rangle_{\ell^\infty\times\ell^1} = 0 \end{array}

Now, Phillips’ Lemma gives

\displaystyle  0 = \lim_n \sum_k |\mu_n(\{k\})|= \lim_n \sum_k |x_n(k))|=\lim_n \|x_n\|_1.

\Box

By the way: Does anyone has a reference for a more direct/elementary proof of Schur’s Theorem?

There are still some things left, I wanted to add about the issue of weak-* convergence in {L^\infty}, non-linear distortions and Young measures. The first is, that Young measures are not able to describe all effects of weak-* convergence, namely, the notion does not handle contractions properly. The second thing is, that there is an alternative approach based on the {\chi}-function which I also find graphically appealing.

1. Concentrations and Young measures

One can distinguish several “modes” that a sequence of functions can obey: In this blog entry of Terry Tao he introduces four more modes apart from oscillation:

  1. escape to horizontal infinity
  2. escape to width infinity
  3. escape to vertical infinity
  4. typewriter sequence

1.1. Escape to horizontal infinity

This mode is most easily described by the sequence {f_n = \chi_{[n,n+1]}}, i.e. the characteristic functions on an interval of unit length which escapes to infinity. Obviously, this sequence does not convergence in any {L^p({\mathbb R})} norm and its weak convergence depends on {p}:

For {p>1} the sequence does converge weakly (weakly-* for {p=\infty}) to zero. This can be seen as follows: Assume that for some non-negative {g\in L^q} (with {1/p + 1/q = 1}) we have {\epsilon \leq \int g f_n = \int_n^{n+1} g}. The we get with Hölders inequality that {\epsilon^q \leq \int_n^{n+1} |g|^q}. But this contradicts the fact that {g\in L^q}.

For {p=1} the sequence does not convergence weakly to zero as can be seen by testing it with the function {g \equiv 1} and also does not converge weakly at all (test with {g = \sum_n (-1)^n\chi_{[n,n+1]}} and observe that the dual pairings do not converge).

However, this type of convergence does not occur in bounded domains, and hence, can not be treated with Young measures as they have been introduced the in my previous entry.

1.2. Escape to width infinity

The paradigm for this mode of convergence is {f_n = \frac{1}{n}\chi_{[0,n]}}. This sequence even convergence strongly in {L^p} for {p>1} but not in strongly in {L^1}. However, it converges weakly to 0 in {L^1}. This mode needs, similar to the previous mode, an unbounded domain.

1.3. Escape to vertical infinity

The prime example, normalized in {L^2(]-1,1[)}, for this mode is {f_n = \sqrt{n}\chi_{[-1/n,1/n]}}. By testing with continuous functions (which is enough by density) one sees that the weak limit is zero.

If one wants to assign some limit to this sequence {f_n} one can say that the measure {f_n^2\mathfrak{L}} does converge weakly in the sense of measures to {2\delta_0}, i.e. twice the point-mass in zero.

Now, what does the Young measure say here?

We check narrow convergence of the Young measures {\mu^{f_n}} by testing with a function of type {\psi(x,y) = \chi_B(x)\phi(y)} for a Borel set {B} and bounded continuous function {\phi}. Then we get for {n\rightarrow \infty}

\displaystyle  |\int_{\Omega\times{\mathbb R}} \psi(x,y){\mathrm d}\mu^{f_n}(x,y)| \leq \int_{-1/n}^{1/n}|\psi(\sqrt{n})|{\mathrm d}\mathfrak{L}(x)\rightarrow 0

Hence,

\displaystyle  \mu^{f_n}\rightharpoonup 0.

We conclude that this mode of convergence can not be seen by Young measures. As Attouch et. al say in Section 4.3.7: “Young measures do not capture concentrations”.

1.4. Typewriter sequence

A typewriter sequence on an interval (as described in Example 4 of this blog entry of Terry Tao) is a sequence of functions which are mostly zero and the non-zero places revisit every place of the interval again, however, with smaller support and integral. This is an example of a sequence which converges in the {L^1}-norm but not pointwise at any point. However, this mode of convergence is not very interesting with respect to Young measures. It basically behaves like “Escape to vertical infinity” above.

2. Weak convergence via the {\chi}-function

While Young measures put a uniformly distributed measure on the graph of the function, and thus, are a more “graphical” representation of the function, the approach described now uses the area between the graph and the {x}-axis.

We consider an open and bounded domain {\Omega\subset {\mathbb R}^d}. Now we define the {\chi}-function as {\chi:{\mathbb R}\times{\mathbb R}\rightarrow {\mathbb R}}

\displaystyle  \chi(\xi,u) = \begin{cases} 1, & 0<\xi<u\\ -1, & u<\xi<0\\ 0, &\text{else}. \end{cases}

The function looks like this:

We then associate to a given function {u:\Omega\rightarrow{\mathbb R}} the function {U:(x,\xi) \mapsto \chi(\xi,u(x))}. Graphically, this function has the value {1}, if the value {\xi} is positive and between zero and {u(x)} and it is {-1}, if {\xi} is negative and again between zero and {u(x)}. In other words: the function {(x,\xi)\mapsto \chi(\xi,u(x))} is piecewise constant of the area between zero and the graph of {u} encoding the sign of {u}. For the functions {f_n} from this Example 1 in the previous post this looks like this:

Similar to the approach via Young measure, we now consider the sequence of the new objects, i.e. the sequence of {(x,\xi)\mapsto \chi(x,f_n(x))} and use a weak form of convergence here. For Young measures we used narrow convergence and here we use simple weak-* convergence.

On can show the following lemma:

Lemma 1 Assume that {f_n} converges weakly-* in {L^\infty({\mathbb R})}. Then, for the weak-* limit of the mappings {(x,\xi)\mapsto\chi(\xi,f_n(x))}, denoted by {F} there exists a probability measure {\nu_x} such that

\displaystyle  \partial_\xi F(x,\cdot) = \delta_0 - \nu_x.

The proof (in a slightly different situation) can be found in Kinetic Formulation of Conservation Laws, Lemma 2.3.1.

Example 1 We again consider Example 1 from my previous post:

\displaystyle  f_n(x) = \begin{cases} a & \text{for }\ \tfrac{2k}{n} \leq x < \tfrac{2k+1}{n},\ k\in{\mathbb Z}\\ b & \text{else.} \end{cases}

The graph of some {f_n} and the corresponding function {F_n:(x,\xi) \mapsto \chi(x,f_n(x))} was shown above. Obviously the weak-* limit of these {\chi}-functions {F_n} is (in the case {b<0<a}) given by

\displaystyle  F(x,\xi) = \begin{cases} \tfrac12, & 0 < \xi < a\\ -\tfrac12, & b< \xi < 0. \end{cases}

This can be illustrated as

Now take the weak derivative with respect to {\xi} (which is, as the function {F} itself, independent of {x}) to get

\displaystyle  \partial_\xi F(\cdot,x) = \delta_0 - \tfrac12 (\delta_a + \delta_b)

and, comparing with Lemma~1, we see

\displaystyle  \nu_x = \tfrac12 (\delta_a + \delta_b).

Cool: That is precisely the same limit as obtained by the Young measure!

Well, the observation in this example is not an accident and indeed this approach is closely related to Young measures. Namely, it holds that {\mu^{f_n}_x\rightharpoonup \nu_x}.

Maybe, I’ll come back to the proof of this fact later (which seemed not too hard, but used a different definitions of a Young measure I used here).

To conclude: Both the approach via Young measures and the approach via the {\chi}-function lead to the same new understanding of weak-* limits in {L^\infty}. This new understanding is a little bit deeper than the usual one as it allows to goes well with non-linear distortions of functions. And finally: Both approaches use a geometric approach: Young measures put a equidistributed measure on the graph of the function and the {\chi}-function puts {\pm1} between the graph and the {x}-axis.

Follow

Get every new post delivered to your Inbox.

Join 56 other followers