Here is a small signal boost for the

# Workshop on the Interface of Statistics and Optimization

to  be held at Duke University, Durham, North Carolina from Feb 8 to Feb 10 2017. The workshop is part of the long-year program on optimization currently taking place at the Statistical and Applied Mathematical Sciences Institute (SAMSI).

There will be a lineup of invited speakers from the forefront of Statistics and Optimization each of which has made influential contributions to the other field as well. The planning is still ongoing, and hence, the list of speakers will grow some.

If you can’t make it to North Carolina next February, still mark the date since the talks will be broadcasted via the net and (if tech works out) you may even participate in the Q&A sessions after the talks via your computer.

Last week Christoph Brauer, Andreas Tillmann and myself uploaded the paper A Primal-Dual Homotopy Algorithm for ${\ell_{1}}$-Minimization with ${\ell_{\infty}}$-Constraints to the arXiv (and we missed being the first ever arXiv-paper with a non-trivial five-digit identifier by twenty-something papers…). This paper specifically deals with the optimization problem

$\displaystyle \begin{array}{rcl} \min_{x}\|x\|_{1}\quad\text{s.t.}\quad \|Ax-b\|_{\infty}\leq \delta \end{array}$

where ${A}$ and ${b}$ are a real matrix and vector, respecively, of compatible size. While the related problem with ${\ell_{2}}$ constraint has been addressed quite often (and the penalized problem ${\min_{x}\|x\|_{1} + \tfrac1{2\lambda}\|Ax-b\|_{2}^{2}}$ is even more popular) there is not much code around to solve this specific problem. Obvious candidates are

• Linear optimization: The problem can be recast as a linear program: The constraint is basically linear already (rewriting it with help of the ones vector ${\mathbf{1}}$ as ${Ax\leq \delta\mathbf{1}+b}$, ${-Ax\leq \delta\mathbf{1}-b}$) and for the objective one can, for example, perform a variable split ${x = x^{+}-x^{-}}$, ${x^{-},x^{+}\geq 0}$ and then write ${\|x\|_{1} = \mathbf{1}^{T}x^{+}+ \mathbf{1}^{T}x^{-}}$.
• Splitting methods: The problem is convex problem of the form ${\min_{x}F(x) + G(Ax)}$ with

$\displaystyle \begin{array}{rcl} F(x) & = & \|x\|_{1}\\ G(y) & = & \begin{cases} 0 & \|y-b\|\leq\delta\\ \infty & \text{else.} \end{cases} \end{array}$

and hence, several methods for these kind of problem are available, such as the alternating direction method of multipliers or the Chambolle-Pock algorithm.

The formulation as linear program has the advantage that one can choose among a lot of highly sophisticated software tools and the second has the advantage that the methods are very easy to code, usually in just a few lines. Drawbacks are, that both methods do exploit too much specific structure of the problem.

Application of the problem with ${\ell_{\infty}}$ constraint are, for example:

• The Dantzig selector, a statistical estimation technique, were the problem is

$\displaystyle \begin{array}{rcl} \min_{x}\|x\|_{1}\quad\text{s.t.}\quad \|A^{T}(Ax-b)\|_{\infty}\leq\delta. \end{array}$

• Sparse dequantization as elaborated here by Jacques, Hammond and Fadili and applied here to de-quantizaton of speech signals by Christoph, Timo Gerkmann and myself.

We wanted to see if one of the most efficient methods known for sparse reconstruction with ${\ell_{2}}$ penalty, namely the homotopy method, can be adapted to this case. The homotopy method for ${\min_{x}\|x\|_{1} + \tfrac1{2\lambda}\|Ax-b\|_{2}^{2}}$ builds on the observation that the solution for ${\lambda\geq \|A^{T}b\|_{\infty}}$ is zero and that the set of solutions ${x_{\lambda}}$, parameterized by the parameter ${\lambda}$, is piecewise linear. Hence, on can start from ${\lambda_{0}= \|A^{T}b\|_{\infty}}$, calculate which direction to go, how far the breakpoint is away, go there and start over. I’ve blogged on the homotopy method here already and there you’ll find some links to great software packages, but also the fact that there can be up to exponentially many breakpoints. However, in practice the homotopy method is usually blazingly fast and with some care, can be made numerically stable and accurate, see, e.g. our extensive study here (and here is the optimization online preprint).

The problem with ${\ell_{\infty}}$ constraint seems similar, especially it is clear that for ${\delta = \|b\|_{\infty}}$, ${x=0}$ is a solution. It is also not so difficult to see that there is a piecewise linear path of solutions ${x_{\delta}}$. What is not so clear is, how it can be computed. It turned out, that in this case the whole truth can be seen when the problem is viewed from a primal-dual viewpoint. The associated dual problem is

$\displaystyle \begin{array}{rcl} \max_{y}\ -b^{T}y - \delta\|y\|_{1}\quad\text{s.t.}\quad\|A^{T}y\|_{\infty}\leq\infty \end{array}$

and a pair ${(x^{*},y^{*})}$ is primal and dual optimal if and only if

$\displaystyle \begin{array}{rcl} -A^{T}y^{*}&\in&\text{Sign}(x^{*})\\ Ax^{*}-b & \in & \delta\text{Sign}(y^{*}) \end{array}$

(where ${\text{Sign}}$ denotes the sign function, multivalued at zero, giving ${[-1,1]}$ there). One can note some things from the primal-dual optimality system:

• You may find a direction ${d}$ such that ${(x^{*}+td,y^{*})}$ stays primal-dual optimal for the constraint ${\leq\delta-t}$ for small ${t}$,
• for a fixed primal optimal ${x_{\delta}}$ there may be several dual optimal ${y_{\delta}}$.

On the other hand, it is not that clear which of the probably many dual optimal ${y_{\delta}}$ allows to find a new direction ${d}$ such that ${x_{\delta}+td}$ with stay primal optimal when reducing ${\delta}$. In fact, it turned out that, at a breakpoint, a new dual variable needs to be found to allow for the next jump in the primal variable. So, the solution path is piecewise linear in the primal variable, but piecewise constant in the dual variable (a situation similar to the adaptive inverse scale space method).

What we found is, that some adapted theorem of the alternative (a.k.a. Farkas’ Lemma) allows to calculate the next dual optimal ${y}$ such that a jump in ${x}$ will be possible.

What is more, is that the calculation of a new primal or dual optimal point amounts to solving a linear program (in contrast to a linear system for ${\ell_{2}}$ homotopy). Hence, the trick of up- and downdating a suitable factorization of a suitable matrix to speed up computation does not work. However, one can somehow leverage the special structure of the problem and use a tailored active set method to progress through the path. Our numerical tests indicated is that the resulting method, which we termed ${\ell_{1}}$-Houdini, is able to solve moderately large problems faster than a commercial LP-solver (while also not only solving the given problem, but calculating the whole solution path on the fly) as can be seen from this table from the paper:

The code of $\ell_1$-Houdini is on Christoph’s homepage, you may also reproduce the data in the above table with your own hardware.

Today I gave a talk at STOR colloquium at the University of North Carolina (UNC). I spoke about the Master’s thesis of Lars Mescheder in which he developed probabilistic models for image denoising. Among other things, he proposed a Gaussian scale mixture model as an image prior and developed several methods to infer information from the respective posterior. One curios result was that the Perona-Malik diffusivity (and variants thereof) popped up in basically every algorithm he derived. It’s not only that the general idea to have an edge dependent diffusion coefficient in a PDE or, formally equivalent, a concave penalty for the magnitude of the gradient in a variational formulation, but it also turned out that the very diffusion coefficient ${\frac1{1+|\nabla u|^{2}}}$ appeared exactly in this form.

Besides the talk I got the chance to chat with many interesting people. I’d like to mention a chat about a background story on Perona-Malik which I find quite interesting:

Steven Pizer, Kenan Professor for computer science at UNC and active in a lot of fields for several decades, and in particular a driving for in the early days of digital image analysis, told that the very idea of the non-linear, gradient-magnitude dependent diffusion coefficient actually dates back to works of neuroscientists Stephen Grossberg and that he moderated a discussion between Stephan Grossberg, Pietro Perona and Jitendra Malik in which the agreed that the method should be attributed as “Grossberg-Perona-Malik” method. I could not find any more hints for this story anywhere else, but given the mathematical, psychological and biological background of Grossberg and a look at Grossbergs list of publications in the 1980’s and earlier make this story appear more than plausible.

It may also well be, that this story is just another instance of the effect, that good ideas usually pop up at more than one place…

Yesterday I uploaded the paper “Linear convergence of the Randomized Sparse Kaczmarz Method” by Frank Schöpfer and myself to the arXiv.

Recall that the Kaczmarz method for linear systems

$\displaystyle \begin{array}{rcl} Ax&=&b \end{array}$

iterates

$\displaystyle \begin{array}{rcl} x^{k+1} &=& x^{k} - \frac{\langle a_{i},x_{k}\rangle-b_{i}}{\|a_{i}\|^{2}}a_{i} \end{array}$

where ${a_{i}}$ is the ${i}$-th row of ${A}$, ${b_{i}}$ is the ${i}$th entry of ${b}$ and the index ${i}$ is chosen according to some rule. We could, e.g. choose the rows in a cyclic manner, i.e. starting with the first row, proceeding to the next row and once we came to the last row we would start over from the first again. It is known (and probably proved by Kaczmarz himself) that the method converges to a solution of ${Ax=b}$ whenever the system has a solution. Moreover, it easy to see that we converge to the minimum norm solution in case of underdetermined systems when the method is initialized with zero. This is due to the fact that the whole iteration takes place in the range space of ${A^{T}}$.

In this and this paper we proposed a simple modification of the Kaczmarz method, that makes it converge to sparse solutions. The modification is simply

$\displaystyle \begin{array}{rcl} z^{k+1} & = & z^{k} - \frac{\langle a_{i},x_{k}\rangle-b_{i}}{\|a_{i}\|^{2}}a_{i}\\ x^{k+1}& = & S_{\lambda}(z^{k+1}) \end{array}$

where ${S_{\lambda}(x) = \max(|x|-\lambda,0)\text{sign}(x)}$ is the soft thresholding function. In this paper we proved that this variant converges, when initialized with zero and for a consistent system, to the solution of

$\displaystyle \begin{array}{rcl} \min_{x}\lambda\|x\|_{1} + \tfrac12\|x\|_{2}^{2},\quad\text{s.t.}\quad Ax=b. \end{array}$

For not too small values of ${\lambda}$ this is indeed a sparse solution of ${Ax=b}$ and Frank also proved that there is a threshold such that for ${\lambda}$ larger than this threshold the solution is also the minimum ${\ell^{1}}$ solution.

In general, convergence rates for the Kaczmarz method (and its sparse variant) are hard to prove. To convince oneself of this fact note that the convergence speed can drastically change if the rows of the system are reordered. The situation changes if one uses a randomization of the sparse Kaczmarz method where, in each iteration, a row is chose at random. Strohmer and Vershynin proved that this randomization leads to linear convergence rate. In the above mentioned paper we were able to prove the same result for the randomized sparse Kaczmarz method. While this sounds like an obvious generalization, the methods we use are totally different. While the linear convergence of the randomized Kaczmarz method can be proven in a few lines(well, probably one page) using only very basic tools, we need, among other things, quite intricate error bounds for Bregman projections w.r.t. piecewise linear-quadratic convex functions.

In fact, the linear convergence can be observed in practice and we illustrate the usefulness of the randomization and also the “sparsification” on some examples in the paper. For example the following figure shows the decay of the residual for the the randomized Kaczmarz method (black), the randomized sparse Kaczmarz method (red) and the randomized sparse Kaczmarz method with exact steps (green) which I did not explain here.

More details are in the paper…

This is the follow up to this post on the completion of the space of Radon measures with respect to transport norm. In the that post we have seen that

$\displaystyle \mathrm{KR}_0(X)^*\cong \mathrm{Lip}_0(X),$

i.e. that the completion of the Radon measure with zero total mass with respect to he dual Lipschitz norm

$\displaystyle \|\mu\|_{\mathrm{Lip}_0}^* = \sup\{\int f{\mathrm d}\mu\ :\ \mathrm{Lip}(f)\leq 1,\ f(e)=0\}$

where ${e\in X}$ is some base point in the metric space ${X}$.

Recall that on a compact metric space ${(X,d)}$ we have ${\mathfrak{M}(X)}$, the space of Radon measures on ${X}$. The Kantorovich-Rubinstein norm for measure is defined by

$\displaystyle \|\mu\|_{\mathrm{KR}} = \sup\{\int f{\mathrm d} \mu\ :\ \mathrm{Lip}(f)\leq 1,\ \|f\|_\infty\leq 1\}.$

Theorem 1 For a compact metric space ${(X,d)}$ it holds that

$\displaystyle \mathrm{KR}(X)^* \cong \mathrm{Lip}(X).$

Proof: We give an explicit identification for ${\mathrm{KR}(X)^*}$ as follows:

1. Define a Lipschitz function from an element of ${\mathrm{KR}(X)^*}$: For every ${\phi\in\mathrm{KR}(X)^*}$ and ${x\in X}$ we set

$\displaystyle (T\phi)(x) = \phi(\delta_x).$

Now we check that by linearity and continuity of ${\phi}$ that for any ${x\in X}$ it holds that

$\displaystyle |(T\phi)(x)| = |\phi(\delta_x)|\leq \|\phi\|\|\delta_x\|_{\mathrm{KR}} = \|\phi\|.$

This shows that ${T\phi:X\rightarrow {\mathbb R}}$ is a bounded function. Similarly for all ${x,y\in X}$ we have

$\displaystyle |(T\phi)(x)-(T\phi)(y)| = |\phi(\delta_x-\delta_y)|\leq \|\phi\|\|\delta_x-\delta_y\|_{\mathrm{KR}}\leq \|\phi\|\min(2,d(x,y))\leq \|\phi\|d(x,y).$

This shows that ${(T\phi)}$ is actually Lipschitz continuous, and moreover, that ${T:\mathrm{KR}(X)^*\rightarrow\mathrm{Lip}(X)}$ is continuous with norm ${\|T\|\leq 1}$.

2. Define an element in ${\mathrm{KR}(X)^*}$ from a Lipschitz function: We just set for ${f\in\mathrm{Lip}(X)}$ and ${\mu\in\mathfrak{M}(X)}$

$\displaystyle (Sf)(\mu) = \int_Xf{\mathrm d}\mu.$

By the definition of the ${\mathrm{KR}}$-norm we have

$\displaystyle |(Sf)(\mu)\leq \|f\|_{\mathrm{Lip}}\|\mu\|_{\mathrm{KR}},$

which shows that ${S(f)}$ can be extended to a continuous and linear functional ${S:\mathrm{KR}(X)\rightarrow{\mathbb R}}$, i.e. ${Sf\in\mathrm{KR}(X)^*}$, and we also have that ${\|S\|\leq 1}$.

3. Check that ${S}$ and ${T}$ invert each other: Finally we check that ${T}$ and ${S}$ are inverses of each other. We begin with ${TS}$: For ${x\in X}$ and ${f\in\mathrm{Lip}(X)}$ observe that

$\displaystyle T(Sf)(x) = Sf(\delta_x) = \int_X f{\mathrm d}\delta_x = f(x)$

i.e. ${TS}$ is the identity on ${\mathrm{Lip}(X)}$. Conversely, for any ${\phi\in\mathrm{KR}(X)^*}$ and ${x\in X}$ we check

$\displaystyle S(T\phi)(\delta_x) = \int_X T\phi{\mathrm d}\delta_x = (T\phi)(x_0) = \phi(\delta_x).$

By density of the Dirac measures in ${\mathrm{KR}(X)}$ we conclude that indeed ${ST\phi = \phi}$, i.e. ${ST}$ is the identity on ${\mathrm{KR}(X)^*}$.

$\Box$

In this post, the first of two related ones, I describe the closure of the space of Radon measures with zero total mass with the respect to the Kantorovich-Rubinstein norm. Somebody asked me what this was after some talk – I did not know the answer and couldn’t figure it out, so I asked this over on mathoverflow, got the right pointers, and here’s the post. A great deal of the post comes from the book Lipschitz Algebras by Nik Weaver (the very same Nik Weaver who answered over at MO) but beware that the notation is slightly different in this post (especially the usage of ${\mathrm{KR}}$).

1. The Kantorovich-Rubinstein norm on the space of Radon measures

Let ${(X,d)}$ be a metric space and denote by ${\mathfrak{M}(X)}$ the space of Radon measures on ${X}$. We will work with compact ${X}$ throughout this post, although in many places compactness is not needed. Following Bourbaki we say that the space ${\mathfrak{M}(X)}$ is defined as the dual of the space ${C(X)}$ of continuous functions on ${X}$, equipped with the supremum norm. The dual pairing of a measure ${\mu}$ and a continuous function ${f}$ is then

$\displaystyle \langle \mu,f\rangle = \int_X f{\mathrm d} \mu$

(and the integral can also be understood in terms of standard measure theory where a measure is defined by a set function of a ${\sigma}$-algebra). The norm of ${\mathfrak{M}(X)}$ is

$\displaystyle \|\mu\|_{\mathfrak{M}} = \sup\{\int_X f{\mathrm d}\mu\ :\ \|f\|_\infty\leq 1\}$

and is called variation norm or Radon norm (sometimes also total variation, not to be confused with the seminorm on the space of functions with bounded variation with a similar name). The variation norm captures some of the topological structure of ${X}$ but does not take into account the metric structure of ${X}$ (in fact, in can be defined on compact topological spaces ${X}$).

One can define further norms and semi-norms on ${\mathfrak{M}(X)}$ and subspaces thereof and two of them that are particularly suited to capture the underlying metric structure of ${X}$ are:

1. The dual Lipschitz-norm

$\displaystyle \|\mu\|_{\mathrm{Lip}_0}^* = \sup\{\int f{\mathrm d} \mu\ :\ \mathrm{Lip}(f)\leq 1\}$

gives finite values on the closed subspace ${\mathfrak{M}_0(X)}$ of measures with total mass equal to zero, i.e. ${\mu(X)=0}$.

2. The Kantorovich-Rubinstein norm

$\displaystyle \|\mu\|_{\mathrm{KR}} = \sup\{\int f{\mathrm d} \mu\ :\ \mathrm{Lip}(f)\leq 1,\ \|f\|_\infty\leq 1\}$

gives finite values for all Radon measures.

Actually, it is not totally trivial to show that these are indeed norm but also not too hard (you may look up this fact for ${\mathrm{Lip}_0^*}$ in Proposition 2.3.2 in the above mentioned book Lipschitz Algebras and the case of ${\mathrm{KR}}$ is basically similar).

Obviously ${\|\mu\|_{\mathrm{KR}}\leq \|\mu\|_{\mathfrak{M}}}$. The reversed inequality is not true and also does not even hold up to a constant if ${X}$ has a non-constant convergent sequence. This can be seen by observing that for ${x\neq y}$ one has ${\|\delta_x - \delta_y\|_{\mathfrak{M}} = 2}$ while ${\|\delta_x - \delta_y\|_{\mathrm{KR}} = \|\delta_x-\delta_y\|_{\mathrm{Lip}_0}^* \leq d(x,y)}$.

While ${(\mathfrak{M}(X),\|\cdot\|_{\mathfrak{M}})}$ is a Banach space (by its very definition as a dual space) it is not a priori clear if ${(\mathfrak{M}(X),\|\cdot\|_{\mathrm{KR}})}$ is also a Banach space. In fact, it is clear that it is not:

Theorem 1 The space ${(\mathfrak{M}(X),\|\cdot\|_{\mathrm{KR}})}$ is complete if and only if ${X}$ is finite, similarly for ${(\mathfrak{M}_0(X),\|\cdot\|_{\mathrm{Lip}_0}^*)}$.

Proof: Let ${X}$ be finite with ${n}$ elements, say. Then ${\mathfrak{M}(X)}$ is naturally identified with ${{\mathbb R}^n}$ and there all norm are equivalent, which shows completeness with all norms. Now let ${X}$ be infinite. Since ${X}$ is also compact, there are, for every ${\epsilon>0}$ two points ${x,y\in X}$ with ${d(x,y)<\epsilon}$. For these points we’ve already observed that ${\|\delta_x - \delta_y\|_{\mathrm{KR}} <\epsilon}$ while ${\|\delta_x - \delta_y\|_{\mathfrak{M}} = 2}$. This shows that the identity ${(\mathfrak{M}(X),\|\cdot\|_{\mathfrak{M}}) \rightarrow (\mathfrak{M}(X),\|\cdot\|_{\mathrm{KR}})}$ is a continuous linear bijection but does not have a continuous inverse. So the open mapping theorem shows that ${(\mathfrak{M}(X),\|\cdot\|_{\mathrm{KR}})}$ can not be complete. The argument for ${(\mathfrak{M}_0(X),\|\cdot\|_{\mathrm{Lip}_0}^*)}$ is the same. $\Box$

This raises two related questions:

1. What is the completion of the space of Radon measures ${\mathfrak{M}(X)}$ with respect to Kantorovich-Rubinstein norm ${\|\cdot\|_\mathrm{KR}}$?
2. What is the completion of the space of Radon measures with zero mass ${\mathfrak{M}_0(X)}$ with respect to the dual Lipschitz norm ${\|\cdot\|_{\mathrm{Lip}_0}^*}$?

In this post we’ll see the answer to the second question.

Note that the proof above indicates some important phenomenon: Diracs with opposite signs that are getting closer seem to be a particular thing here. Indeed, if ${x_k}$ and ${y_k}$ are two sequences in ${X}$ with ${x_k-y_k\rightarrow 0}$ then for

$\displaystyle \mu_k = \delta_{x_k} - \delta_{y_k}$

it holds that ${\|\mu_k\|_{\mathrm{KR}}\leq d(x_k,y_k)\rightarrow 0}$ (similarly for ${\|\cdot\|_{\mathrm{Lip}_0}^*}$) but ${\mu_k}$ is not converging to zero with respect to ${\|\cdot\|_{\mathfrak{M}}}$. It holds that ${\mu_k}$ does converge weakly to zero if ${x_k}$ (and hence ${y_k}$) does converge to something, but in the case that ${x_k}$ (and hence ${y_k}$) does not converge, ${\mu_k}$ does not even converge weakly.

We’ll come to a description of the completion in a minute but first we introduce several notions that will be used.

2. Spaces of Lipschitz functions

For a compact metric space ${(X,d)}$ we say that ${f:X\rightarrow{\mathbb R}}$ is Lipschitz continuous if it has finite Lipschitz-constant

$\displaystyle \mathrm{Lip}(f) = \sup_{x\neq y}\frac{|f(x)-f(y)|}{d(x,y)}.$

The theorem of Arzela-Ascoli shows that the space ${\mathrm{Lip}(X)}$ of all Lipschitz continuous functions is a Banach space, when equipped with the norm

$\displaystyle \|f\|_{\mathrm{Lip}} = \max(\|f\|_\infty,\mathrm{Lip}(f)).$

If we additionally mark a distinguished base point ${e\in X}$ we can also consider the space

$\displaystyle \mathrm{Lip}_0(X) = \{f\in \mathrm{Lip}(X)\ :\ f(e)=0\}.$

On this space the Lipschitz constant itself is a norm (and not only a semi-norm) and we denote it by

$\displaystyle \|f\|_{\mathrm{Lip}_0} = \mathrm{Lip}(f).$

In fact the spaces ${\mathrm{Lip}}$ and ${\mathrm{Lip}_0}$ are really closely related since in the case of metric spaces of finite diameter one space can be turned into the other and vice versa. The idea is to enhance a given metric space ${X}$ with an additional base point that we put in some way “right in the middle of ${X}$” or, maybe more accurately, “at equal distance of anything” as follows:

Theorem 2 For a compact metric space ${(X,d)}$ (with no special base point) and diameter ${\mathrm{diam}(X) = D}$ denote ${X^+ := X\cup\{e\}}$ for some new point ${e}$ and let ${d^+:X^+\times X^+\rightarrow{\mathbb R}}$ be equal to ${d}$ on ${X}$ and satisfy ${d^+(x,e)=D/2}$ for all ${x\in X}$. Then:

1. ${(X^+,d^+)}$ is a compact metric space with diameter ${\mathrm{diam}(X^+)=D}$.
2. For ${f\in\mathrm{Lip}(X)}$ define ${f^+\in \mathrm{Lip}_0(X^+)}$ by ${f^+(x) = f(x)}$ for ${x\in X}$ and ${f^+(e)=0}$. Then the mapping ${T:f\mapsto f^+}$ is a homeomorphism from ${\mathrm{Lip}(X)}$ to ${\mathrm{Lip}_0(X^+)}$.

Proof:

1. The only thing to check for ${d^+}$ to be a metric is that ${d^+}$ also fulfills the triangle inequality: but since for ${x,y\in X}$ it holds that ${d^+(x,e)+d^+(e,y) = D/2 + D/2 = D\geq d(x,y) = d^+(x,y)}$ this is fulfilled.
2. Let ${f\in \mathrm{Lip}(X)}$. Then

$\displaystyle \frac{|f^+(x)-f^+(e)|}{d^+(x,e)} = \frac{|f(p)-0|}{D/2}\leq \frac2D\|f\|_\infty \leq \frac2D\|f\|_{\mathrm{Lip}}$

and consequently

$\displaystyle \mathrm{Lip}(f^+) \leq\tfrac2D\|f\|_{\mathrm{Lip}}.$

On the other hand for all ${x\in X}$

$\displaystyle |f(x)| = \frac{D}{2}\frac{|f(x)-0|}{D/2}\leq \frac{D}2\mathrm{Lip}(f^+)$

which shows ${\|f\|_\infty\leq\tfrac{D}2\mathrm{Lip}(f^+)}$. Obviously we have ${\mathrm{Lip}(f)\leq \mathrm{Lip}(f^+)}$, which finishes the argument.

$\Box$

3. The Arens-Eells space

That spaces of Lipschitz functions should play a role for the ${\mathrm{KR}}$– and ${\mathrm{Lip}_0^*}$-norm seems pretty obvious—Lipschitz functions are used it the very definition of the norm. Now we come to some space for which the relation with the ${\mathrm{KR}}$– and ${\mathrm{Lip}_0^*}$-norm may not be that obvious but which is indeed very much related.

Definition 3 For a compact metric space ${(X,d)}$ we define a molecule as a function ${m:X\rightarrow{\mathbb R}}$ with finite support with ${\sum_{x\in X} m(x)=0}$.

Obviously, the set of molecules is a linear space.

Any molecule can be identified with a measure on ${X}$ as follows: For a molecule ${m}$ define a measure

$\displaystyle \mu_m = \sum_{x\in X} m(x)\delta_{x}.$

In other words: At every point ${x}$ of the support of ${m}$ we put a delta with weight ${m(x)}$.

Since the support of ${m}$ is finite, the sum is actually finite and in addition, this measure ${\mu_m}$ has zero total measure since

$\displaystyle \mu_m(X) = \sum_{x\in X}m(x)\delta_{x}(X) = \sum_{x\in X}m(x)=0.$

So one may say that the linear space of all molecules is naturally identified with the linear space of all finite linear combinations of deltas with zero total mass.

On the space of molecules we define the so-called Arens-Eells norm. To do so we introduce elementary molecules consisting having a two-element support with values ${\pm 1}$ on them, namely for ${x,y\in X}$

$\displaystyle m_{xy}(z) = \begin{cases} 1, & z=x\\ -1, & z=y\\ 0, & \text{else.} \end{cases}$

Note that any molecule ${m}$ can be written as finite linear combination of elementary molecules. (To do so you may order the support of some given ${m}$ as ${\{x_1,\dots x_n\}}$, start with ${m(x_1)m_{x_1 x_2}}$, then form ${m^1 = m - m(x_1)m_{x_1x_2}}$, observe that ${m^1}$ has a support of size ${n-1}$, and proceed iteratively as started). Importantly, the decomposition of a molecule into elementary molecules is not unique.

Definition 4 For a molecule ${m}$ on ${X}$ we set

$\displaystyle \|m\|_{\AE} = \inf\{\sum_{i=1}^n |a_i|d(x_i,y_i)\ :\ m = \sum_{i=1}^n a_i m_{x_iy_i}\}$

where the infimum is taken over all decomposition of ${m}$ into elementary molecules.

Actually, it is not clear from the start that ${\|\cdot\|_{\AE}}$ is indeed a norm. While homogeneity and the triangle inequality are more or less straight forward, definiteness is not clear from the beginning. We will see that in a minute, but it makes sense to first introduce the completion of the space of molecules with respect to the (semi-)norm ${\|\cdot\|_{\AE}}$ since its dual is indeed familiar. Since we do not know yet that ${\|\cdot\|_{\AE}}$ is a norm, we have to be careful with the completion:

Definition 5 The Arens-Eells space ${\AE(X)}$ on a compact metric space is the completion of the space of molecules with respect to the seminorm ${\|\cdot\|_{\AE}}$ modulo elements with zero norm.

The great thing is that the dual of ${\AE(X)}$ is a well known space:

Theorem 6 For a compact metric space ${(X,d)}$ with some base point ${e}$ it holds that

$\displaystyle \AE(X)^* \cong \mathrm{Lip}_0(X).$

On bounded sets, weak* convergence in ${\mathrm{Lip}_0(X)}$ agrees with pointwise convergence.

Proof: We represent elements of ${\AE(X)^*}$ as Lipschitz functions by the mapping ${T_1:\AE(X)^*\rightarrow \mathrm{Lip}_0(X)}$ be setting for ${\phi\in \AE(X)^*}$ and ${x\in X}$

$\displaystyle (T_1\phi)(x) = \phi(m_{xe})$

Note that ${(T_1\phi)(e) = \phi(m_{ee}) = 0}$. By the definition of ${\|\cdot\|_{\AE}}$ it holds that ${\|m_{xy}\|_{\AE}\leq d(x,y)}$ and hence,

$\displaystyle |(T_1\phi)(x)-(T_1\phi)(y)| = |\phi(m_{xe} - m_{ye})| = |\phi(m_{xy})|\leq \|\phi\|d(x,y)$

which shows that ${\mathrm{Lip}(T_1\phi)\leq \|\phi\|}$ and we conclude that ${T_1\phi\in\mathrm{Lip}_0(X)}$ and also that ${\|T_1\|\leq 1}$. Conversely, we now define a map that associates to every Lipschitz function vanishing at the base point an element in ${\AE(X)^*}$: Define ${T_2:\mathrm{Lip}_0(X)\rightarrow \AE(X)^*}$ for any molecule ${m}$ and any ${f\in \mathrm{Lip}_0(X)}$ by

$\displaystyle (T_2f)(m) = \sum_{x\in X} f(x)m(x).$

Note that for a decomposed molecule ${m = \sum_{i=1}^n a_i m_{x_iy_i}}$ we have

$\displaystyle \begin{array}{rcl} |(T_2 f)(m)| & =&|(T_2 f)(\sum_{i=1}^n a_i m_{x_iy_i})\\ & \leq & \sum_{i=1}^n |a_i| |(T_2 f)(m_{x_iy_i})|\\ & = &\sum_{i=1}^n |a_i| |f(x_i) - f(y_i)|\\ & \leq & \mathrm{Lip}(f) \sum_{i=1}^n |a_i| d(x_i,y_i). \end{array}$

Taking the infimum over all decompositions, we conclude ${|(T_2 f)(m)|\leq \mathrm{Lip}(f)\|m\|_{\AE}}$ showing that ${T_2 f\in \AE(X)^*}$ and ${\|T_2f\|\leq\mathrm{Lip}(f)}$, i.e, ${\|T_2\|\leq 1}$. Now let’s look at ${T_1T_2}$: For ${x\in X}$ and ${f\in\mathrm{Lip}_0(X)}$ it holds that ${T_1(T_2 f)(x) = (T_2 f)(m_{xe}) = f(x) - f(e) = f(x)}$, i.e. ${T_1T_2 = I}$. Also for ${\phi\in \AE(X)^*}$ and a molecule ${m}$: ${T_2(T_1\phi)(m) = \sum_{x\in X}(T_1\phi)(x)m(x) = \sum_{x\in X}\phi(m_{xe})m(x) = \phi(\sum_{x\in X}m(x)m_{xe})= \phi(m)}$. So ${T_2}$ and ${T_1}$ are indeed inverses and hence provide identifications of ${\AE(X)^*}$ and ${\mathrm{Lip}_0(X)}$. Now consider a sequence of function ${f_n\in \mathrm{Lip}_0(X)}$ that converges weakly* to ${f}$. This says that (in the notation above) for any molecule ${m}$ it holds that ${(T_2f_n)(m)\rightarrow (T_2 f)(m)}$. Particularly for the molecules ${m_{xe}}$ we have

$\displaystyle (T_2 f_n)(m_{xe}) = \sum_{y\in X} f_n(y)m_{xe}(y) = f_n(y).$

It follows that ${f_n(x)\rightarrow f(x)}$ for all ${x}$. This shows that weak* convergence implies pointwise convergence. If, conversely, ${f_n}$ converges pointwise, we conclude that ${(T_2 f_n)(m_{xe}) \rightarrow (T_2 f)(m_{xe})}$ for all ${m_{xe}}$. By definition, these are dense in ${\AE(X)}$ and since we assume that the ${f_n}$ are bounded by assumption, this implies weak* convergence. $\Box$

The above proof also shows that ${\|\cdot\|_{\AE}}$ is indeed a norm and moreover, that it can be written as

$\displaystyle \|m\|_{\AE} = \sup\{\sum_{x\in X} f(x)m(x)\ :\ f(e)=0,\ \mathrm{Lip}(f)\leq 1\}$

(also, by Arzela-Ascoli, there is a Lipschitz function ${f}$ with ${\mathrm{Lip}(f)\leq 1}$ and ${f(e)=0}$ where the supremum is attained, but we will not need this). This shows the following fact:

Corollary 7 For any molecule ${m}$ on a compact metric space ${X}$ with base point ${e}$ and the corresponding measure ${\mu_m}$ it follows that

$\displaystyle \|m\|_{\AE} = \|\mu_m\|_{\mathrm{KR}}.$

4. The completion of ${\mathfrak{M}_0(X)}$ under the dual Lipschitz norm

Now we come to the answer of the second question posed in the beginning. We define by ${\mathrm{KR}_0(X)}$ the completion of ${\mathfrak{M}_0(X)}$ (the space of Radon measures with zero total mass) with respect to ${\|\cdot\|_{\mathrm{Lip}_0}^*}$ and aim at a characterization of this space.

Theorem 8 It holds that ${\AE(X)\cong \mathrm{KR}_0(X)}$ isometrically.

Proof: We first show that any measure ${\mu}$ on ${X}$ with zero total mass can be approximated in the ${\mathrm{Lip}_0^*}$-norm by measures ${\mu_m}$ defined my molecules ${m}$. So let ${\epsilon>0}$ and ${\mu\in \mathrm{KR}_0(X)}$. We set ${\epsilon' = \epsilon/\|\mu\|_{\mathfrak{M}}}$. Now we cover ${X}$ by an ${\epsilon'}$-net, that is, we take ${x_1,\dots,x_n}$ such that balls ${B_k}$ around ${x_k}$ with radius ${\epsilon'}$ cover ${X}$. From this cover we define by ${V_1=U_1}$, ${V_k = U_k\setminus(U_1\cup\cdots\cup U_{k-1})}$ a disjoint cover of ${X}$ (i.e. a partition). Now we define a molecule ${m}$ by

$\displaystyle m(x) = \left\{ \begin{array}{ll} \mu(V_k), & \text{if}\ x=x_k\\ 0, & \text{else.} \end{array} \right.$

Clearly ${m}$ is a molecule (its support is the ${n}$ point ${x_k}$ and the sum ${\sum_{x}m(x)}$ equals ${\mu(X)=0}$). To see that ${\mu_m}$ approximates ${\mu}$ in the ${\mathrm{Lip}_0^*}$-norm we start as follows: For any function ${f}$ with ${\mathrm{Lip}(f)\leq 1}$ we see

$\displaystyle \begin{array}{rcl} |\int_{V_k}f{\mathrm d}(\mu-\mu_m)| &=& |\int_{V_k} f{\mathrm d}\mu - \int_{V_k}f{\mathrm d}\mu_m|\\ &=& |\int_{V_k} f{\mathrm d}\mu - f(x_k)\mu(V_k)|\\ &=& |\int_{V_k} (f-f(x_k)){\mathrm d}\mu|\\ &\leq& |\int_{V_k} \mathrm{Lip}(f)\epsilon'{\mathrm d}\mu| = \epsilon'\|\mu\|_{\mathfrak{M}(V_k)}. \end{array}$

Summing this over ${k}$ we get for every ${f}$ with ${\mathrm{Lip}(f)\leq 1}$

$\displaystyle |\int_X f{\mathrm d}(\mu-\mu_m)|\leq \epsilon'\|\mu\|_{\mathfrak{M}} = \epsilon.$

Taking the supremum over these ${f}$ gives ${\|\mu-\mu_m\|_{\mathrm{Lip}_0}^*\leq \epsilon}$ as desired. Corollary 7 shows that ${\|m|\|_{\AE}=\|\mu_m\|_{\mathrm{Lip}_0}^*}$ and from the above we see that these measures are dense in ${\mathrm{KR}_0(X)}$. Hence, the correspondence ${m\mapsto\mu_m}$ is an isometry on dense subsets of ${\AE(X)}$ and ${\mathrm{KR}_0(X)}$ which proves the assertion. $\Box$

Finally, we come to the announced description of the space ${\mathrm{KR}_0(X)}$ (defined as completion of ${(\mathfrak{M}_0(X),\|\cdot\|_{\mathrm{Lip}_0}^*)}$): It is equal to the Arens-Eells space and hence, its dual space is the space of Lipschitz functions rooted on a base point.

Corollary 9 For a compact metric space ${(X,d)}$ with basepoint it holds that

$\displaystyle \mathrm{KR}_0(X)^*\cong\mathrm{Lip}_0(X).$

To conclude this post, I just mention that the techniques used above are related to the Kuratowski embedding and the related Kuratowski–Wojdysławski embedding that embed metric spaces into certain Banach spaces.

Do you remember free time as a kid that you wasted with connecting dots? If you actually liked it, here’s some good news: There are dot-to-dot books for grown-ups! Most notable, there are the books by Thomas Pavitte with great pictures with 1000 dots.

So, these are some of the books

and here is some video:

Actually, it takes some time to connect 1000 dots; I need ten minutes or so, depending a bit an the picture.

For the fun of it, I coded some lines in MATLAB to connect the dots automatically. And since I am a lazy programmer, I did not bother to connect the dots in the manner that was prescribed by the artist but more efficiently:

1. Greedy paths

For the greedy path, we start at some randomly chosen dot and connect the dot where we are with the closest possible dot where we haven’t been already.

Here’s how this looks for one of Thomas’ pictures:

2. Shortest paths

The greedy path sometimes makes very large jumps (when it runs into some corner, using all the dots in the vicinity). This leads to some spurious large jumps in the picture. In used some simple heuristics to find some “locally shortest paths” through the thousand dots. (And “locally shortest” means that there are no two edges for which swapping improves the total lengths of the paths.) Actually, I started out with the goal to solve the travelling salesman problem over the thousand dots, i.e., to find the shortest path of all. Then it turned out that

1. Solving the travelling salesman problem is not that simple to solve – well it’s one of the best investigated NP-hard optimization problems and there is no doubt that it would take my laptop only little time to solve it with 1000 dots if fed with the right code.
2. The locally shortest path already looks quite appealing and I am not sure how the shortest path would look any different.

Here is the locally shortest path:

Oh, by the way: the image is a pug dog, the one that you can party see on this cover

Here are some more pictures (not by Thomas Pavitte). Middle is the greedy, right the locally shortest path:

This slideshow requires JavaScript.

Since the Wikipedia page of the travelling salesman problem contains a formulation as an integer linear program to solve it, I may give it a try in the future…