The problem of optimal transport of mass from one distribution to another can be stated in many forms. Here is the formulation going back to Kantorovich: We have two measurable sets ${\Omega_{1}}$ and ${\Omega_{2}}$, coming with two measures ${\mu_{1}}$ and ${\mu_{2}}$. We also have a function ${c:\Omega_{1}\times \Omega_{2}\rightarrow {\mathbb R}}$ which assigns a transport cost, i.e. ${c(x_{1},x_{2})}$ is the cost that it takes to carry one unit of mass from point ${x_{1}\in\Omega_{2}}$ to ${x_{2}\in\Omega_{2}}$. What we want is a plan that says where the mass in ${\mu_{1}}$ should be placed in ${\Omega_{2}}$ (or vice versa). There are different ways to formulate this mathematically.

A simple way is to look for a map ${T:\Omega_{1}\rightarrow\Omega_{2}}$ which says that thet mass in ${x_{1}}$ should be moved to ${T(x_{1})}$. While natural, there is a serious problem with this: What if not all mass at ${x_{1}}$ should go to the same point in ${\Omega_{2}}$? This happens in simple situations where all mass in ${\Omega_{1}}$ sits in just one point, but there are at least two different points in ${\Omega_{2}}$ where mass should end up. This is not going to work with a map ${T}$ as above. So, the map ${T}$ is not flexible enough to model all kinds of transport we may need.

What we want is a way to distribute mass from one point in ${\Omega_{1}}$ to the whole set ${\Omega_{2}}$. This looks like we want maps ${\mathcal{T}}$ which map points in ${\Omega_{1}}$ to functions on ${\Omega_{2}}$, i.e. something like ${\mathcal{T}:\Omega_{1}\rightarrow (\Omega_{2}\rightarrow{\mathbb R})}$ where ${(\Omega_{2}\rightarrow{\mathbb R})}$ stands for some set of functions on ${\Omega_{2}}$. We can de-curry this function to some ${\tau:\Omega_{1}\times\Omega_{2}\rightarrow{\mathbb R}}$ by ${\tau(x_{1},x_{2}) = \mathcal{T}(x_{1})(x_{2})}$. That’s good in principle, be we still run into problems when the target mass distribution ${\mu_{2}}$ is singular in the sense that ${\Omega_{2}}$ is a “continuous” set and there are single points in ${\Omega_{2}}$ that carry some mass according to ${\mu_{2}}$. Since we are in the world of measure theory already, the way out suggests itself: Instead of a function ${\tau}$ on ${\Omega_{1}\times\Omega_{2}}$ we look for a measure ${\pi}$ on ${\Omega_{1}\times \Omega_{2}}$ as a transport plan.

The demand that we should carry all of the mass in ${\Omega_{1}}$ to reach all of ${\mu_{2}}$ is formulated by marginals. For simplicity we just write these constraints as

$\displaystyle \int_{\Omega_{2}}\pi\, d x_{2} = \mu_{1},\qquad \int_{\Omega_{1}}\pi\, d x_{1} = \mu_{2}$

(with the understanding that the first equation really means that for all continuous function ${f:\Omega_{1}\rightarrow {\mathbb R}}$ it holds that ${\int_{\Omega_{1}\times \Omega_{2}} f(x_{1})\,d\pi(x_{1},x_{2}) = \int_{\Omega_{1}}f(x_{1})\,d\mu_{1}(x_{1})}$).

This leads us to the full transport problem

$\displaystyle \min_{\pi}\int_{\Omega_{1}\times \Omega_{2}}c(x,y)\,d\pi(x_{1}x_{2})\quad \text{s.t.}\quad \int_{\Omega_{2}}\pi\, d x_{2} = \mu_{1},\quad \int_{\Omega_{1}}\pi\, d x_{1} = \mu_{2}.$

There is the following theorem which characterizes optimality of a plan and which is the topic of this post:

Theorem 1 (Fundamental theorem of optimal transport) Under some technicalities we can say that a plan ${\pi}$ which fulfills the marginal constraints is optimal if and only if one of the following equivalent conditions is satisfied:

1. The support ${\mathrm{supp}(\pi)}$ of ${\pi}$ is ${c}$-cyclically monotone.
2. There exists a ${c}$-concave function ${\phi}$ such that its ${c}$-superdifferential contains the support of ${\pi}$, i.e. ${\mathrm{supp}(\pi)\subset \partial^{c}\phi}$.

A few clarifications: The technicalities involve continuity, integrability, and boundedness conditions of ${c}$ and integrability conditions on the marginals. The full theorem can be found as Theorem 1.13 in A user’s guide to optimal transport by Ambrosio and Gigli. Also the notions ${c}$-cyclically monotone, ${c}$-concave and ${c}$-superdifferential probably need explanation. We start with a simpler notion: ${c}$-monotonicity:

Definition 2 A set ${\Gamma\subset\Omega_{1}\times\Omega_{2}}$ is ${c}$-monotone, if for all ${(x_{1}x_{2}),(x_{1}',x_{2}')\in\Gamma}$ it holds that

$\displaystyle c(x_{1},x_{2}) + c(x_{1}',x_{2}')\leq c(x_{1},x_{2}') + c(x_{1}',x_{2}).$

If you find it unclear what this has to do with monotonicity, look at this example:

Example 1 Let ${\Omega_{1/2}\in{\mathbb R}^{d}}$ and let ${c(x_{1},x_{2}) = \langle x_{1},x_{2}\rangle}$ be the usual scalar product. Then ${c}$-monotonicity is the condition that for all ${(x_{1}x_{2}),(x_{1}',x_{2}')\in\Gamma\subset\Omega_{1}\times\Omega_{2}}$ it holds that

$\displaystyle 0\leq \langle x_{1}-x_{1}',x_{2}-x_{2}'\rangle$

which may look more familiar. Indeed, when ${\Omega_{1}}$ and ${\Omega_{2}}$ are subset of the real line, the above conditions means that the set ${\Gamma}$ somehow “moves up in ${\Omega_{2}}$” if we “move right in ${\Omega_{1}}$”. So ${c}$-monotonicity for ${c(x_{1},x_{2}) = \langle x_{1},x_{2}\rangle}$ is something like “monotonically increasing”. Similarly, for ${c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle}$, ${c}$-monotonicity means “monotonically decreasing”.

You may say that both ${c(x_{1},x_{2}) = \langle x_{1},x_{2}\rangle}$ and ${c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle}$ are strange cost functions and I can’t argue with that. But here comes: ${c(x_{1},x_{2}) = |x_{1}-x_{2}|^{2}}$ (${|\,\cdot\,|}$ being the euclidean norm) seems more natural, right? But if we have a transport plan ${\pi}$ for this ${c}$ for some marginals ${\mu_{1}}$ and ${\mu_{2}}$ we also have

$\displaystyle \begin{array}{rcl} \int_{\Omega_{1}\times \Omega_{2}}c(x_{1},x_{2})d\pi(x_{1},x_{2}) & = & \int_{\Omega_{1}\times \Omega_{2}}|x_{1}|^{2} d\pi(x_{1},x_{2})\\ &&\quad- \int_{\Omega_{1}\times \Omega_{2}}\langle x_{1},x_{2}\rangle d\pi(x_{1},x_{2})\\ && \qquad+ \int_{\Omega_{1}\times \Omega_{2}} |x_{2}|^{2}d\pi(x_{1},x_{2})\\ & = &\int_{\Omega_{1}}|x_{1}|^{2}d\mu_{1}(x_{1}) - \int_{\Omega_{1}\times \Omega_{2}}\langle x_{1},x_{2}\rangle d\pi(x_{1},x_{2}) + \int_{\Omega_{2}}|x_{2}|^{2}d\mu_{2}(x_{2}) \end{array}$

i.e., the transport cost for ${c(x_{1},x_{2}) = |x_{1}-x_{2}|^{2}}$ differs from the one for ${c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle}$ only by a constant independent of ${\pi}$, so may well use the latter.

The fundamental theorem of optimal transport uses the notion of ${c}$-cyclical monotonicity which is stronger that just ${c}$-monotonicity:

Definition 3 A set ${\Gamma\subset \Omega_{1}\times \Omega_{2}}$ is ${c}$-cyclically monotone, if for all ${(x_{1}^{i},x_{2}^{i})\in\Gamma}$, ${i=1,\dots n}$ and all permutations ${\sigma}$ of ${\{1,\dots,n\}}$ it holds that

$\displaystyle \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{i}) \leq \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{\sigma(i)}).$

For ${n=2}$ we get back the notion of ${c}$-monotonicity.

Definition 4 A function ${\phi:\Omega_{1}\rightarrow {\mathbb R}}$ is ${c}$-concave if there exists some function ${\psi:\Omega_{2}\rightarrow{\mathbb R}}$ such that

$\displaystyle \phi(x_{1}) = \inf_{x_{2}\in\Omega_{2}}c(x_{1},x_{2}) - \psi(x_{2}).$

This definition of ${c}$-concavity resembles the notion of convex conjugate:

Example 2 Again using ${c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle}$ we get that a function ${\phi}$ is ${c}$-concave if

$\displaystyle \phi(x_{1}) = \inf_{x_{2}}-\langle x_{1},x_{2}\rangle - \psi(x_{2}),$

and, as an infimum over linear functions, ${\phi}$ is clearly concave in the usual way.

Definition 5 The ${c}$-superdifferential of a ${c}$-concave function is

$\displaystyle \partial^{c}\phi = \{(x_{1},x_{2})\mid \phi(x_{1}) + \phi^{c}(x_{2}) = c(x,y)\},$

where ${\phi^{c}}$ is the ${c}$-conjugate of ${\phi}$ defined by

$\displaystyle \phi^{c}(x_{2}) = \inf_{x_{1}\in\Omega_{1}}c(x_{1},x_{2}) -\phi(x_{1}).$

Again one may look at ${c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle}$ and observe that the ${c}$-superdifferential is the usual superdifferential related to the supergradient of concave functions (there is a Wikipedia page for subgradient only, but the concept is the same with reversed signs in some sense).

Now let us sketch the proof of the fundamental theorem of optimal transport: \medskip

Proof (of the fundamental theorem of optimal transport). Let ${\pi}$ be an optimal transport plan. We aim to show that ${\mathrm{supp}(\pi)}$ is ${c}$-cyclically monotone and assume the contrary. That is, we assume that there are points ${(x_{1}^{i},x_{2}^{i})\in\mathrm{supp}(\pi)}$ and a permutation ${\sigma}$ such that

$\displaystyle \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{i}) > \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{\sigma(i)}).$

We aim to construct a ${\tilde\pi}$ such that ${\tilde\pi}$ is still feasible but has a smaller transport cost. To do so, we note that continuity of ${c}$ implies that there are neighborhoods ${U_{i}}$ of ${x_{1}^{i}}$ and ${V_{i}}$ of ${x_{2}^{i}}$ such that for all ${u_{i}\in U_{1}}$ and ${v_{i}\in V_{i}}$ it holds that

$\displaystyle \sum_{i=1}^{n}c(u_{i},v_{\sigma(i)}) - c(u_{i},v_{i})<0.$

We use this to construct a better plan ${\tilde \pi}$: Take the mass of ${\pi}$ in the sets ${U_{i}\times V_{i}}$ and shift it around. The full construction is a little messy to write down: Define a probability measure ${\nu}$ on the product ${X = \bigotimes_{i=1}^{N}U_{i}\times V_{i}}$ as the product of the measures ${\tfrac{1}{\pi(U_{i}\times V_{i})}\pi|_{U_{i}\times V_{i}}}$. Now let ${P^{U_{1}}}$ and ${P^{V_{i}}}$ be the projections of ${X}$ onto ${U_{i}}$ and ${V_{i}}$, respectively, and set

$\displaystyle \nu = \tfrac{\min_{i}\pi(U_{i}\times V_{i})}{n}\sum_{i=1}^{n}(P^{U_{i}},P^{V_{\sigma(i)}})_{\#}\nu - (P^{U_{i}},P^{V_{i}})_{\#}\nu$

where ${_{\#}}$ denotes the pushforward of measures. Note that the new measure ${\nu}$ is signed and that ${\tilde\pi = \pi + \nu}$ fulfills

1. ${\tilde\pi}$ is a non-negative measure
2. ${\tilde\pi}$ is feasible, i.e. has the correct marginals
3. ${\int c\,d\tilde \pi<\int c\,d\pi}$

which, all together, gives a contradiction to optimality of ${\pi}$. The implication of item 1 to item 2 of the theorem is not really related to optimal transport but a general fact about ${c}$-concavity and ${c}$-cyclical monotonicity (c.f.~this previous blog post of mine where I wrote a similar statement for convexity). So let us just prove the implication from item 2 to optimality of ${\pi}$: Let ${\pi}$ fulfill item 2, i.e. ${\pi}$ is feasible and ${\mathrm{supp}(\pi)}$ is contained in the ${c}$-superdifferential of some ${c}$-concave function ${\phi}$. Moreover let ${\tilde\pi}$ be any feasible transport plan. We aim to show that ${\int c\,d\pi\leq \int c\,d\tilde\pi}$. By definition of the ${c}$-superdifferential and the ${c}$-conjugate we have

$\displaystyle \begin{array}{rcl} \phi(x_{1}) + \phi^{c}(x_{2}) &=& c(x_{1},x_{2})\ \forall (x_{1},x_{2})\in\partial^{c}\phi\\ \phi(x_{1}) + \phi^{c}(x_{2}) & \leq& c(x_{1},x_{2})\ \forall (x_{1},x_{2})\in\Omega_{1}\times \Omega_{2}. \end{array}$

Since ${\mathrm{supp}(\pi)\subset\partial^{c}\phi}$ by assumption, this gives

$\displaystyle \begin{array}{rcl} \int_{\Omega_{1}\times \Omega_{2}}c(x_{1},x_{2})\,d\pi(x_{1},x_{2}) & =& \int_{\Omega_{1}\times \Omega_{2}}\phi(x_{1}) + \phi^{c}(x_{1})\,d\pi(x_{1},x_{2})\\ &=& \int_{\Omega_{1}}\phi(x_{1})\,d\mu_{1}(x_{1}) + \int_{\Omega_{1}}\phi^{c}(x_{2})\,d\mu_{2}(x_{2})\\ &=& \int_{\Omega_{1}\times \Omega_{2}}\phi(x_{1}) + \phi^{c}(x_{1})\,d\tilde\pi(x_{1},x_{2})\\ &\leq& \int_{\Omega_{1}\times \Omega_{2}}c(x_{1},x_{2})\,d\tilde\pi(x_{1},x_{2}) \end{array}$

which shows the claim.

${\Box}$

Corollary 6 If ${\pi}$ is a measure on ${\Omega_{1}\times \Omega_{2}}$ which is supported on a ${c}$-superdifferential of a ${c}$-concave function, then ${\pi}$ is an optimal transport plan for its marginals with respect to the transport cost ${c}$.

This is the follow up to this post on the completion of the space of Radon measures with respect to transport norm. In the that post we have seen that

$\displaystyle \mathrm{KR}_0(X)^*\cong \mathrm{Lip}_0(X),$

i.e. that the completion of the Radon measure with zero total mass with respect to he dual Lipschitz norm

$\displaystyle \|\mu\|_{\mathrm{Lip}_0}^* = \sup\{\int f{\mathrm d}\mu\ :\ \mathrm{Lip}(f)\leq 1,\ f(e)=0\}$

where ${e\in X}$ is some base point in the metric space ${X}$.

Recall that on a compact metric space ${(X,d)}$ we have ${\mathfrak{M}(X)}$, the space of Radon measures on ${X}$. The Kantorovich-Rubinstein norm for measure is defined by

$\displaystyle \|\mu\|_{\mathrm{KR}} = \sup\{\int f{\mathrm d} \mu\ :\ \mathrm{Lip}(f)\leq 1,\ \|f\|_\infty\leq 1\}.$

Theorem 1 For a compact metric space ${(X,d)}$ it holds that

$\displaystyle \mathrm{KR}(X)^* \cong \mathrm{Lip}(X).$

Proof: We give an explicit identification for ${\mathrm{KR}(X)^*}$ as follows:

1. Define a Lipschitz function from an element of ${\mathrm{KR}(X)^*}$: For every ${\phi\in\mathrm{KR}(X)^*}$ and ${x\in X}$ we set

$\displaystyle (T\phi)(x) = \phi(\delta_x).$

Now we check that by linearity and continuity of ${\phi}$ that for any ${x\in X}$ it holds that

$\displaystyle |(T\phi)(x)| = |\phi(\delta_x)|\leq \|\phi\|\|\delta_x\|_{\mathrm{KR}} = \|\phi\|.$

This shows that ${T\phi:X\rightarrow {\mathbb R}}$ is a bounded function. Similarly for all ${x,y\in X}$ we have

$\displaystyle |(T\phi)(x)-(T\phi)(y)| = |\phi(\delta_x-\delta_y)|\leq \|\phi\|\|\delta_x-\delta_y\|_{\mathrm{KR}}\leq \|\phi\|\min(2,d(x,y))\leq \|\phi\|d(x,y).$

This shows that ${(T\phi)}$ is actually Lipschitz continuous, and moreover, that ${T:\mathrm{KR}(X)^*\rightarrow\mathrm{Lip}(X)}$ is continuous with norm ${\|T\|\leq 1}$.

2. Define an element in ${\mathrm{KR}(X)^*}$ from a Lipschitz function: We just set for ${f\in\mathrm{Lip}(X)}$ and ${\mu\in\mathfrak{M}(X)}$

$\displaystyle (Sf)(\mu) = \int_Xf{\mathrm d}\mu.$

By the definition of the ${\mathrm{KR}}$-norm we have

$\displaystyle |(Sf)(\mu)\leq \|f\|_{\mathrm{Lip}}\|\mu\|_{\mathrm{KR}},$

which shows that ${S(f)}$ can be extended to a continuous and linear functional ${S:\mathrm{KR}(X)\rightarrow{\mathbb R}}$, i.e. ${Sf\in\mathrm{KR}(X)^*}$, and we also have that ${\|S\|\leq 1}$.

3. Check that ${S}$ and ${T}$ invert each other: Finally we check that ${T}$ and ${S}$ are inverses of each other. We begin with ${TS}$: For ${x\in X}$ and ${f\in\mathrm{Lip}(X)}$ observe that

$\displaystyle T(Sf)(x) = Sf(\delta_x) = \int_X f{\mathrm d}\delta_x = f(x)$

i.e. ${TS}$ is the identity on ${\mathrm{Lip}(X)}$. Conversely, for any ${\phi\in\mathrm{KR}(X)^*}$ and ${x\in X}$ we check

$\displaystyle S(T\phi)(\delta_x) = \int_X T\phi{\mathrm d}\delta_x = (T\phi)(x_0) = \phi(\delta_x).$

By density of the Dirac measures in ${\mathrm{KR}(X)}$ we conclude that indeed ${ST\phi = \phi}$, i.e. ${ST}$ is the identity on ${\mathrm{KR}(X)^*}$.

$\Box$

In this post, the first of two related ones, I describe the closure of the space of Radon measures with zero total mass with the respect to the Kantorovich-Rubinstein norm. Somebody asked me what this was after some talk – I did not know the answer and couldn’t figure it out, so I asked this over on mathoverflow, got the right pointers, and here’s the post. A great deal of the post comes from the book Lipschitz Algebras by Nik Weaver (the very same Nik Weaver who answered over at MO) but beware that the notation is slightly different in this post (especially the usage of ${\mathrm{KR}}$).

1. The Kantorovich-Rubinstein norm on the space of Radon measures

Let ${(X,d)}$ be a metric space and denote by ${\mathfrak{M}(X)}$ the space of Radon measures on ${X}$. We will work with compact ${X}$ throughout this post, although in many places compactness is not needed. Following Bourbaki we say that the space ${\mathfrak{M}(X)}$ is defined as the dual of the space ${C(X)}$ of continuous functions on ${X}$, equipped with the supremum norm. The dual pairing of a measure ${\mu}$ and a continuous function ${f}$ is then

$\displaystyle \langle \mu,f\rangle = \int_X f{\mathrm d} \mu$

(and the integral can also be understood in terms of standard measure theory where a measure is defined by a set function of a ${\sigma}$-algebra). The norm of ${\mathfrak{M}(X)}$ is

$\displaystyle \|\mu\|_{\mathfrak{M}} = \sup\{\int_X f{\mathrm d}\mu\ :\ \|f\|_\infty\leq 1\}$

and is called variation norm or Radon norm (sometimes also total variation, not to be confused with the seminorm on the space of functions with bounded variation with a similar name). The variation norm captures some of the topological structure of ${X}$ but does not take into account the metric structure of ${X}$ (in fact, in can be defined on compact topological spaces ${X}$).

One can define further norms and semi-norms on ${\mathfrak{M}(X)}$ and subspaces thereof and two of them that are particularly suited to capture the underlying metric structure of ${X}$ are:

1. The dual Lipschitz-norm

$\displaystyle \|\mu\|_{\mathrm{Lip}_0}^* = \sup\{\int f{\mathrm d} \mu\ :\ \mathrm{Lip}(f)\leq 1\}$

gives finite values on the closed subspace ${\mathfrak{M}_0(X)}$ of measures with total mass equal to zero, i.e. ${\mu(X)=0}$.

2. The Kantorovich-Rubinstein norm

$\displaystyle \|\mu\|_{\mathrm{KR}} = \sup\{\int f{\mathrm d} \mu\ :\ \mathrm{Lip}(f)\leq 1,\ \|f\|_\infty\leq 1\}$

gives finite values for all Radon measures.

Actually, it is not totally trivial to show that these are indeed norm but also not too hard (you may look up this fact for ${\mathrm{Lip}_0^*}$ in Proposition 2.3.2 in the above mentioned book Lipschitz Algebras and the case of ${\mathrm{KR}}$ is basically similar).

Obviously ${\|\mu\|_{\mathrm{KR}}\leq \|\mu\|_{\mathfrak{M}}}$. The reversed inequality is not true and also does not even hold up to a constant if ${X}$ has a non-constant convergent sequence. This can be seen by observing that for ${x\neq y}$ one has ${\|\delta_x - \delta_y\|_{\mathfrak{M}} = 2}$ while ${\|\delta_x - \delta_y\|_{\mathrm{KR}} = \|\delta_x-\delta_y\|_{\mathrm{Lip}_0}^* \leq d(x,y)}$.

While ${(\mathfrak{M}(X),\|\cdot\|_{\mathfrak{M}})}$ is a Banach space (by its very definition as a dual space) it is not a priori clear if ${(\mathfrak{M}(X),\|\cdot\|_{\mathrm{KR}})}$ is also a Banach space. In fact, it is clear that it is not:

Theorem 1 The space ${(\mathfrak{M}(X),\|\cdot\|_{\mathrm{KR}})}$ is complete if and only if ${X}$ is finite, similarly for ${(\mathfrak{M}_0(X),\|\cdot\|_{\mathrm{Lip}_0}^*)}$.

Proof: Let ${X}$ be finite with ${n}$ elements, say. Then ${\mathfrak{M}(X)}$ is naturally identified with ${{\mathbb R}^n}$ and there all norm are equivalent, which shows completeness with all norms. Now let ${X}$ be infinite. Since ${X}$ is also compact, there are, for every ${\epsilon>0}$ two points ${x,y\in X}$ with ${d(x,y)<\epsilon}$. For these points we’ve already observed that ${\|\delta_x - \delta_y\|_{\mathrm{KR}} <\epsilon}$ while ${\|\delta_x - \delta_y\|_{\mathfrak{M}} = 2}$. This shows that the identity ${(\mathfrak{M}(X),\|\cdot\|_{\mathfrak{M}}) \rightarrow (\mathfrak{M}(X),\|\cdot\|_{\mathrm{KR}})}$ is a continuous linear bijection but does not have a continuous inverse. So the open mapping theorem shows that ${(\mathfrak{M}(X),\|\cdot\|_{\mathrm{KR}})}$ can not be complete. The argument for ${(\mathfrak{M}_0(X),\|\cdot\|_{\mathrm{Lip}_0}^*)}$ is the same. $\Box$

This raises two related questions:

1. What is the completion of the space of Radon measures ${\mathfrak{M}(X)}$ with respect to Kantorovich-Rubinstein norm ${\|\cdot\|_\mathrm{KR}}$?
2. What is the completion of the space of Radon measures with zero mass ${\mathfrak{M}_0(X)}$ with respect to the dual Lipschitz norm ${\|\cdot\|_{\mathrm{Lip}_0}^*}$?

In this post we’ll see the answer to the second question.

Note that the proof above indicates some important phenomenon: Diracs with opposite signs that are getting closer seem to be a particular thing here. Indeed, if ${x_k}$ and ${y_k}$ are two sequences in ${X}$ with ${x_k-y_k\rightarrow 0}$ then for

$\displaystyle \mu_k = \delta_{x_k} - \delta_{y_k}$

it holds that ${\|\mu_k\|_{\mathrm{KR}}\leq d(x_k,y_k)\rightarrow 0}$ (similarly for ${\|\cdot\|_{\mathrm{Lip}_0}^*}$) but ${\mu_k}$ is not converging to zero with respect to ${\|\cdot\|_{\mathfrak{M}}}$. It holds that ${\mu_k}$ does converge weakly to zero if ${x_k}$ (and hence ${y_k}$) does converge to something, but in the case that ${x_k}$ (and hence ${y_k}$) does not converge, ${\mu_k}$ does not even converge weakly.

We’ll come to a description of the completion in a minute but first we introduce several notions that will be used.

2. Spaces of Lipschitz functions

For a compact metric space ${(X,d)}$ we say that ${f:X\rightarrow{\mathbb R}}$ is Lipschitz continuous if it has finite Lipschitz-constant

$\displaystyle \mathrm{Lip}(f) = \sup_{x\neq y}\frac{|f(x)-f(y)|}{d(x,y)}.$

The theorem of Arzela-Ascoli shows that the space ${\mathrm{Lip}(X)}$ of all Lipschitz continuous functions is a Banach space, when equipped with the norm

$\displaystyle \|f\|_{\mathrm{Lip}} = \max(\|f\|_\infty,\mathrm{Lip}(f)).$

If we additionally mark a distinguished base point ${e\in X}$ we can also consider the space

$\displaystyle \mathrm{Lip}_0(X) = \{f\in \mathrm{Lip}(X)\ :\ f(e)=0\}.$

On this space the Lipschitz constant itself is a norm (and not only a semi-norm) and we denote it by

$\displaystyle \|f\|_{\mathrm{Lip}_0} = \mathrm{Lip}(f).$

In fact the spaces ${\mathrm{Lip}}$ and ${\mathrm{Lip}_0}$ are really closely related since in the case of metric spaces of finite diameter one space can be turned into the other and vice versa. The idea is to enhance a given metric space ${X}$ with an additional base point that we put in some way “right in the middle of ${X}$” or, maybe more accurately, “at equal distance of anything” as follows:

Theorem 2 For a compact metric space ${(X,d)}$ (with no special base point) and diameter ${\mathrm{diam}(X) = D}$ denote ${X^+ := X\cup\{e\}}$ for some new point ${e}$ and let ${d^+:X^+\times X^+\rightarrow{\mathbb R}}$ be equal to ${d}$ on ${X}$ and satisfy ${d^+(x,e)=D/2}$ for all ${x\in X}$. Then:

1. ${(X^+,d^+)}$ is a compact metric space with diameter ${\mathrm{diam}(X^+)=D}$.
2. For ${f\in\mathrm{Lip}(X)}$ define ${f^+\in \mathrm{Lip}_0(X^+)}$ by ${f^+(x) = f(x)}$ for ${x\in X}$ and ${f^+(e)=0}$. Then the mapping ${T:f\mapsto f^+}$ is a homeomorphism from ${\mathrm{Lip}(X)}$ to ${\mathrm{Lip}_0(X^+)}$.

Proof:

1. The only thing to check for ${d^+}$ to be a metric is that ${d^+}$ also fulfills the triangle inequality: but since for ${x,y\in X}$ it holds that ${d^+(x,e)+d^+(e,y) = D/2 + D/2 = D\geq d(x,y) = d^+(x,y)}$ this is fulfilled.
2. Let ${f\in \mathrm{Lip}(X)}$. Then

$\displaystyle \frac{|f^+(x)-f^+(e)|}{d^+(x,e)} = \frac{|f(p)-0|}{D/2}\leq \frac2D\|f\|_\infty \leq \frac2D\|f\|_{\mathrm{Lip}}$

and consequently

$\displaystyle \mathrm{Lip}(f^+) \leq\tfrac2D\|f\|_{\mathrm{Lip}}.$

On the other hand for all ${x\in X}$

$\displaystyle |f(x)| = \frac{D}{2}\frac{|f(x)-0|}{D/2}\leq \frac{D}2\mathrm{Lip}(f^+)$

which shows ${\|f\|_\infty\leq\tfrac{D}2\mathrm{Lip}(f^+)}$. Obviously we have ${\mathrm{Lip}(f)\leq \mathrm{Lip}(f^+)}$, which finishes the argument.

$\Box$

3. The Arens-Eells space

That spaces of Lipschitz functions should play a role for the ${\mathrm{KR}}$– and ${\mathrm{Lip}_0^*}$-norm seems pretty obvious—Lipschitz functions are used it the very definition of the norm. Now we come to some space for which the relation with the ${\mathrm{KR}}$– and ${\mathrm{Lip}_0^*}$-norm may not be that obvious but which is indeed very much related.

Definition 3 For a compact metric space ${(X,d)}$ we define a molecule as a function ${m:X\rightarrow{\mathbb R}}$ with finite support with ${\sum_{x\in X} m(x)=0}$.

Obviously, the set of molecules is a linear space.

Any molecule can be identified with a measure on ${X}$ as follows: For a molecule ${m}$ define a measure

$\displaystyle \mu_m = \sum_{x\in X} m(x)\delta_{x}.$

In other words: At every point ${x}$ of the support of ${m}$ we put a delta with weight ${m(x)}$.

Since the support of ${m}$ is finite, the sum is actually finite and in addition, this measure ${\mu_m}$ has zero total measure since

$\displaystyle \mu_m(X) = \sum_{x\in X}m(x)\delta_{x}(X) = \sum_{x\in X}m(x)=0.$

So one may say that the linear space of all molecules is naturally identified with the linear space of all finite linear combinations of deltas with zero total mass.

On the space of molecules we define the so-called Arens-Eells norm. To do so we introduce elementary molecules consisting having a two-element support with values ${\pm 1}$ on them, namely for ${x,y\in X}$

$\displaystyle m_{xy}(z) = \begin{cases} 1, & z=x\\ -1, & z=y\\ 0, & \text{else.} \end{cases}$

Note that any molecule ${m}$ can be written as finite linear combination of elementary molecules. (To do so you may order the support of some given ${m}$ as ${\{x_1,\dots x_n\}}$, start with ${m(x_1)m_{x_1 x_2}}$, then form ${m^1 = m - m(x_1)m_{x_1x_2}}$, observe that ${m^1}$ has a support of size ${n-1}$, and proceed iteratively as started). Importantly, the decomposition of a molecule into elementary molecules is not unique.

Definition 4 For a molecule ${m}$ on ${X}$ we set

$\displaystyle \|m\|_{\AE} = \inf\{\sum_{i=1}^n |a_i|d(x_i,y_i)\ :\ m = \sum_{i=1}^n a_i m_{x_iy_i}\}$

where the infimum is taken over all decomposition of ${m}$ into elementary molecules.

Actually, it is not clear from the start that ${\|\cdot\|_{\AE}}$ is indeed a norm. While homogeneity and the triangle inequality are more or less straight forward, definiteness is not clear from the beginning. We will see that in a minute, but it makes sense to first introduce the completion of the space of molecules with respect to the (semi-)norm ${\|\cdot\|_{\AE}}$ since its dual is indeed familiar. Since we do not know yet that ${\|\cdot\|_{\AE}}$ is a norm, we have to be careful with the completion:

Definition 5 The Arens-Eells space ${\AE(X)}$ on a compact metric space is the completion of the space of molecules with respect to the seminorm ${\|\cdot\|_{\AE}}$ modulo elements with zero norm.

The great thing is that the dual of ${\AE(X)}$ is a well known space:

Theorem 6 For a compact metric space ${(X,d)}$ with some base point ${e}$ it holds that

$\displaystyle \AE(X)^* \cong \mathrm{Lip}_0(X).$

On bounded sets, weak* convergence in ${\mathrm{Lip}_0(X)}$ agrees with pointwise convergence.

Proof: We represent elements of ${\AE(X)^*}$ as Lipschitz functions by the mapping ${T_1:\AE(X)^*\rightarrow \mathrm{Lip}_0(X)}$ be setting for ${\phi\in \AE(X)^*}$ and ${x\in X}$

$\displaystyle (T_1\phi)(x) = \phi(m_{xe})$

Note that ${(T_1\phi)(e) = \phi(m_{ee}) = 0}$. By the definition of ${\|\cdot\|_{\AE}}$ it holds that ${\|m_{xy}\|_{\AE}\leq d(x,y)}$ and hence,

$\displaystyle |(T_1\phi)(x)-(T_1\phi)(y)| = |\phi(m_{xe} - m_{ye})| = |\phi(m_{xy})|\leq \|\phi\|d(x,y)$

which shows that ${\mathrm{Lip}(T_1\phi)\leq \|\phi\|}$ and we conclude that ${T_1\phi\in\mathrm{Lip}_0(X)}$ and also that ${\|T_1\|\leq 1}$. Conversely, we now define a map that associates to every Lipschitz function vanishing at the base point an element in ${\AE(X)^*}$: Define ${T_2:\mathrm{Lip}_0(X)\rightarrow \AE(X)^*}$ for any molecule ${m}$ and any ${f\in \mathrm{Lip}_0(X)}$ by

$\displaystyle (T_2f)(m) = \sum_{x\in X} f(x)m(x).$

Note that for a decomposed molecule ${m = \sum_{i=1}^n a_i m_{x_iy_i}}$ we have

$\displaystyle \begin{array}{rcl} |(T_2 f)(m)| & =&|(T_2 f)(\sum_{i=1}^n a_i m_{x_iy_i})\\ & \leq & \sum_{i=1}^n |a_i| |(T_2 f)(m_{x_iy_i})|\\ & = &\sum_{i=1}^n |a_i| |f(x_i) - f(y_i)|\\ & \leq & \mathrm{Lip}(f) \sum_{i=1}^n |a_i| d(x_i,y_i). \end{array}$

Taking the infimum over all decompositions, we conclude ${|(T_2 f)(m)|\leq \mathrm{Lip}(f)\|m\|_{\AE}}$ showing that ${T_2 f\in \AE(X)^*}$ and ${\|T_2f\|\leq\mathrm{Lip}(f)}$, i.e, ${\|T_2\|\leq 1}$. Now let’s look at ${T_1T_2}$: For ${x\in X}$ and ${f\in\mathrm{Lip}_0(X)}$ it holds that ${T_1(T_2 f)(x) = (T_2 f)(m_{xe}) = f(x) - f(e) = f(x)}$, i.e. ${T_1T_2 = I}$. Also for ${\phi\in \AE(X)^*}$ and a molecule ${m}$: ${T_2(T_1\phi)(m) = \sum_{x\in X}(T_1\phi)(x)m(x) = \sum_{x\in X}\phi(m_{xe})m(x) = \phi(\sum_{x\in X}m(x)m_{xe})= \phi(m)}$. So ${T_2}$ and ${T_1}$ are indeed inverses and hence provide identifications of ${\AE(X)^*}$ and ${\mathrm{Lip}_0(X)}$. Now consider a sequence of function ${f_n\in \mathrm{Lip}_0(X)}$ that converges weakly* to ${f}$. This says that (in the notation above) for any molecule ${m}$ it holds that ${(T_2f_n)(m)\rightarrow (T_2 f)(m)}$. Particularly for the molecules ${m_{xe}}$ we have

$\displaystyle (T_2 f_n)(m_{xe}) = \sum_{y\in X} f_n(y)m_{xe}(y) = f_n(y).$

It follows that ${f_n(x)\rightarrow f(x)}$ for all ${x}$. This shows that weak* convergence implies pointwise convergence. If, conversely, ${f_n}$ converges pointwise, we conclude that ${(T_2 f_n)(m_{xe}) \rightarrow (T_2 f)(m_{xe})}$ for all ${m_{xe}}$. By definition, these are dense in ${\AE(X)}$ and since we assume that the ${f_n}$ are bounded by assumption, this implies weak* convergence. $\Box$

The above proof also shows that ${\|\cdot\|_{\AE}}$ is indeed a norm and moreover, that it can be written as

$\displaystyle \|m\|_{\AE} = \sup\{\sum_{x\in X} f(x)m(x)\ :\ f(e)=0,\ \mathrm{Lip}(f)\leq 1\}$

(also, by Arzela-Ascoli, there is a Lipschitz function ${f}$ with ${\mathrm{Lip}(f)\leq 1}$ and ${f(e)=0}$ where the supremum is attained, but we will not need this). This shows the following fact:

Corollary 7 For any molecule ${m}$ on a compact metric space ${X}$ with base point ${e}$ and the corresponding measure ${\mu_m}$ it follows that

$\displaystyle \|m\|_{\AE} = \|\mu_m\|_{\mathrm{KR}}.$

4. The completion of ${\mathfrak{M}_0(X)}$ under the dual Lipschitz norm

Now we come to the answer of the second question posed in the beginning. We define by ${\mathrm{KR}_0(X)}$ the completion of ${\mathfrak{M}_0(X)}$ (the space of Radon measures with zero total mass) with respect to ${\|\cdot\|_{\mathrm{Lip}_0}^*}$ and aim at a characterization of this space.

Theorem 8 It holds that ${\AE(X)\cong \mathrm{KR}_0(X)}$ isometrically.

Proof: We first show that any measure ${\mu}$ on ${X}$ with zero total mass can be approximated in the ${\mathrm{Lip}_0^*}$-norm by measures ${\mu_m}$ defined my molecules ${m}$. So let ${\epsilon>0}$ and ${\mu\in \mathrm{KR}_0(X)}$. We set ${\epsilon' = \epsilon/\|\mu\|_{\mathfrak{M}}}$. Now we cover ${X}$ by an ${\epsilon'}$-net, that is, we take ${x_1,\dots,x_n}$ such that balls ${B_k}$ around ${x_k}$ with radius ${\epsilon'}$ cover ${X}$. From this cover we define by ${V_1=U_1}$, ${V_k = U_k\setminus(U_1\cup\cdots\cup U_{k-1})}$ a disjoint cover of ${X}$ (i.e. a partition). Now we define a molecule ${m}$ by

$\displaystyle m(x) = \left\{ \begin{array}{ll} \mu(V_k), & \text{if}\ x=x_k\\ 0, & \text{else.} \end{array} \right.$

Clearly ${m}$ is a molecule (its support is the ${n}$ point ${x_k}$ and the sum ${\sum_{x}m(x)}$ equals ${\mu(X)=0}$). To see that ${\mu_m}$ approximates ${\mu}$ in the ${\mathrm{Lip}_0^*}$-norm we start as follows: For any function ${f}$ with ${\mathrm{Lip}(f)\leq 1}$ we see

$\displaystyle \begin{array}{rcl} |\int_{V_k}f{\mathrm d}(\mu-\mu_m)| &=& |\int_{V_k} f{\mathrm d}\mu - \int_{V_k}f{\mathrm d}\mu_m|\\ &=& |\int_{V_k} f{\mathrm d}\mu - f(x_k)\mu(V_k)|\\ &=& |\int_{V_k} (f-f(x_k)){\mathrm d}\mu|\\ &\leq& |\int_{V_k} \mathrm{Lip}(f)\epsilon'{\mathrm d}\mu| = \epsilon'\|\mu\|_{\mathfrak{M}(V_k)}. \end{array}$

Summing this over ${k}$ we get for every ${f}$ with ${\mathrm{Lip}(f)\leq 1}$

$\displaystyle |\int_X f{\mathrm d}(\mu-\mu_m)|\leq \epsilon'\|\mu\|_{\mathfrak{M}} = \epsilon.$

Taking the supremum over these ${f}$ gives ${\|\mu-\mu_m\|_{\mathrm{Lip}_0}^*\leq \epsilon}$ as desired. Corollary 7 shows that ${\|m|\|_{\AE}=\|\mu_m\|_{\mathrm{Lip}_0}^*}$ and from the above we see that these measures are dense in ${\mathrm{KR}_0(X)}$. Hence, the correspondence ${m\mapsto\mu_m}$ is an isometry on dense subsets of ${\AE(X)}$ and ${\mathrm{KR}_0(X)}$ which proves the assertion. $\Box$

Finally, we come to the announced description of the space ${\mathrm{KR}_0(X)}$ (defined as completion of ${(\mathfrak{M}_0(X),\|\cdot\|_{\mathrm{Lip}_0}^*)}$): It is equal to the Arens-Eells space and hence, its dual space is the space of Lipschitz functions rooted on a base point.

Corollary 9 For a compact metric space ${(X,d)}$ with basepoint it holds that

$\displaystyle \mathrm{KR}_0(X)^*\cong\mathrm{Lip}_0(X).$

To conclude this post, I just mention that the techniques used above are related to the Kuratowski embedding and the related Kuratowski–Wojdysławski embedding that embed metric spaces into certain Banach spaces.

Optimal transport in a discrete setting goes as follows: Consider two vectors ${p,q\in{\mathbb R}^N}$ with non-negative entries and ${\|p\|_1 = \|q\|_1}$. You may also say that ${p}$ and ${q}$ are two probability distributions or discrete measures with same total mass. A transport plan for ${p}$ and ${q}$ is a matrix ${\pi\in{\mathbb R}^{N\times N}}$ with non-negative entries such that

$\displaystyle \sum_i \pi_{i,j} = p_j,\quad\sum_j\pi_{i,j} = q_i.$

The interpretation of ${\pi}$ is, that ${\pi_{i,j}}$ indicates how much of the mass ${p_j}$ sitting in the ${j}$-th entry of ${p}$ is transported to the ${i}$-th entry of ${q}$. To speak about optimal transport we add an objective function to the constraints, namely, the cost ${c_{i,j}}$ that says how much it costs to transport one unit from the ${j}$-th entry in ${p}$ to the ${i}$-th entry in ${q}$. Then the optimal transport problem is

$\displaystyle \min_{\pi\in{\mathbb R}^{N\times N}} \sum_{i,j}c_{i,j}\pi_{i,j}\quad\text{s.t.}\quad\sum_i \pi_{i,j} = p_j,\quad\sum_j\pi_{i,j} = q_i.$

The resulting ${\pi}$ is an optimal transport plan and the resulting objective value is the minimal cost at which ${p}$ can be transported to ${q}$. In fact the minimization problem is a linear program and not only that, it’s one of the best studied linear programs and I am sure there is a lot that I don’t know about the structure of this linear program (you may have a look at these slides by Jesus De Loera to get an impression what is known about the structure of the linear program)

So it looks like discrete optimal transport is a fairly standard problem with standard solvers available. But all solvers have one severe drawback when it comes to large ${N}$: The optimization variable has ${N^2}$ entries. If ${N^2}$ is too large to store ${\pi}$ or keep ${\pi}$ in memory, it seems that there is not much one can do. This is the memory bottleneck for optimal transport problems.

1. Kantorovich-Rubinstein duality

In the case when the cost has the special form ${c_{i,j} = |i-j|}$ on can reduce the memory-burden. This special cost makes sense if the indices ${i}$ and ${j}$ correspond to spatial locations, since then the cost ${c_{i,j} = |i-j|}$ is just the distance from location ${i}$ to ${j}$. It turns out that in this case there is a simple dual optimal transport problem, namely

$\displaystyle \max_{f\in{\mathbb R}^N} f^T(p-q)\quad\text{s.t.}\quad |f_i-f_{i-1}|\leq 1,\ 2\leq i\leq N.$

(This is a simple form of the Kantorovich-Rubinstein duality and works similarly if the cost ${c}$ is any other metric on the set of indices.) The new optimization problem is still linear but the memory requirements is only ${N}$ and not ${N^2}$ anymore and moreover there are only ${O(N)}$ constraints for ${f}$. This idea is behind the method from the paper Imaging with Kantorovich-Rubinstein discrepancy by Jan Lellmann, myself, Carola Schönlieb and Tuomo Valkonen.

2. Entropic regularization

In this post I’d like to describe another method to break through the memory bottleneck of optimal transport. This method works for basically any cost ${c}$ but involves a little bit of smoothing/regularization.

We go from the linear program to a non-linear but still convex one by adding the negative entropy of the transport plan to the objective, i.e. we consider the objective

$\displaystyle \sum_{i,j}\Big[c_{i,j}\pi_{i,j} + \gamma \pi_{i,j}(\log(\pi_{i,j}) -1)\Big]$

for some ${\gamma>0}$.

What’s the point of doing so? Let’s look at the Lagrangian: For the constraints ${\sum_i \pi_{i,j} = p_j}$ and ${\sum_j\pi_{i,j} = q_i}$ we introduce the ones vector ${{\mathbf 1}\in{\mathbb R}^n}$, write them as ${\pi^T{\mathbf 1} = p}$ and ${\pi{\mathbf 1}=q}$, add Lagrange multipliers ${\alpha}$ and ${\beta}$ and get

$\begin{array}{rl}\mathcal{L}(\pi,\alpha,\beta) = & \sum_{i,j}\Big[c_{i,j}\pi_{i,j} + \pi_{i,j}(\log(\pi_{i,j}) -1)\Big]\\ & \quad+ \alpha^T(\pi^T{\mathbf 1}-p) + \beta^T(\pi{\mathbf 1}-q)\end{array}$

The cool thing happens with the optimality condition when deriving ${\mathcal{L}}$ with respect to ${\pi_{i,j}}$:

$\displaystyle \partial_{\pi_{i,j}}\mathcal{L} = c_{i,j} + \gamma\log(\pi_{i,j}) + \alpha_j + \beta_i \stackrel{!}{=} 0$

We can solve for ${\pi_{i,j}}$ and get

$\displaystyle \pi_{i,j} = \exp(-\tfrac{c_{i,j}}{\gamma})\exp(-\tfrac{\alpha_j}\gamma)\exp(-\tfrac{\beta_i}\gamma).$

What does that say? It says that the optimal ${\pi}$ is obtained from the matrix

$\displaystyle M_{i,j} = \exp(-\tfrac{c_{i,j}}{\gamma})$

with rows and columns rescaled by vectors ${u_j = \exp(-\tfrac{\alpha_j}\gamma)}$ and ${v_i = \exp(-\tfrac{\beta_i}\gamma)}$, respectively, i.e.

$\displaystyle \pi = \mathbf{diag}(v)M\mathbf{diag}(u).$

The reduces the memory requirement from ${N^2}$ to ${2N}$! The cost for doing so is the regularization by the entropy.

Actually, the existence of the vectors ${u}$ and ${v}$ also follows from Sinkhorn’s theorem which states that for every matrix ${A}$ with positive entries there exists diagonal matrices $D_1, D_2$ with positive diagonals such that ${A = D_1\pi D_2}$ for some doubly stochastic matrix ${\pi}$ (i.e. one with non-negative entries and unit row and column sums). Replace “doubly stochastic” with “predefined positive row and column sums”. Also, the entropic regularization for the transport plan ensures that the entries of the transport plan has positive (especially non-zero) entries.

But there is even more to the story:. To calculate the vectors ${u}$ and ${v}$ you simply have to do the following iteration:

$\displaystyle \begin{array}{rcl} u^{n+1} &=& \frac{p}{Mv^n}\\ v^{n+1} &=& \frac{q}{M^Tu^{n+1}} \end{array}$

where the fraction means element-wise division. Pretty simple.

What the iteration does in the first step is simply to take the actual ${v}$ and calculates a column scaling ${u}$ such that the column sums match ${p}$. In the second step it calculates the row scaling ${v}$ such that the row sums match ${q}$. This iteration is also known as Sinkhorn-Knopp algorithm.

This is pretty simple to code in MATLAB. Here is a simple code that does the above iteration (using $c_{i,j} = |i-j|^2$):

%Parameters
gamma = 10; %reg for entropy
maxiter = 100; % maxiter
map = colormap(gray);

N = 100; % size
x = linspace(0,1,N)';%spatial coordinate

% marginals
p = exp(-(x-0.2).^2*10^2) + exp(-abs(x-0.4)*20);p=p./sum(p); %for colums sum
q = exp(-(x-0.8).^2*10^2);q = q./sum(q); % for row sum

[i,j] = meshgrid(1:N);
M = exp(-(i-j).^2/gamma); % exp(-cost/gamma)

% intialize u and v
u = ones(N,1);v = ones(N,1);

% Sinkhorn-Knopp
% iteratively scale rows and columns
for k = 1:maxiter
% update u and v
u = p./(M*v);
v = q./(M'*u);
% assemble pi (only for illustration purposes)
pi = diag(v)*M*diag(u);
% display pi (with marginals on top and to the left)
imagesc([p'/max(p) 0;pi/max(pi(:)) q/max(q)])
colormap(1-map)
drawnow
end


Here are some results:

(From the left to the right: $\gamma=40,20,10,7$. The first row of pixels is ${p}$, the last column is ${q}$ and in between there is ${\pi}$, all things normalized such that black is the maximal value in ${p}$, ${q}$ and ${\pi}$, respectively.)

You see that for large ${\gamma}$, the plan is much more smooth and not so well localized as it should be for an optimal plan.

Oh, and here is an animation of 100 iterations of Sinkhorn-Knopp showing the result after both $u$ and $v$ have been updated:

There is a catch with this regularization: For small ${\gamma}$ (in this example about ${\gamma\leq 6}$) the method runs into problem with under- and overflow: the entries in ${Mv^n}$ and ${M^Tu^n}$ become very small. One can fight this effect a bit but I don’t have a convincing numerical method to deal with this, yet.It seems that the entries of the optimal $u$ and $v$ really have to be incredibly small and large, and I mean really small and large (in the order of $10^{300}$ and $10^{-300}$ in both $u$ and $v$).

While the Sinkhorn-Knopp algorithm is already from the 1960s, its application to optimal transport seems fairly new – I learned about from in talks by Gabriel Peyré and Bernhard Schmitzer and the reference is Sinkhorn Distances: Lightspeed Computation of Optimal Transport (presented at NIPS 2013) by Marco Cuturi.