The problem of optimal transport of mass from one distribution to another can be stated in many forms. Here is the formulation going back to Kantorovich: We have two measurable sets ${\Omega_{1}}$ and ${\Omega_{2}}$, coming with two measures ${\mu_{1}}$ and ${\mu_{2}}$. We also have a function ${c:\Omega_{1}\times \Omega_{2}\rightarrow {\mathbb R}}$ which assigns a transport cost, i.e. ${c(x_{1},x_{2})}$ is the cost that it takes to carry one unit of mass from point ${x_{1}\in\Omega_{2}}$ to ${x_{2}\in\Omega_{2}}$. What we want is a plan that says where the mass in ${\mu_{1}}$ should be placed in ${\Omega_{2}}$ (or vice versa). There are different ways to formulate this mathematically.

A simple way is to look for a map ${T:\Omega_{1}\rightarrow\Omega_{2}}$ which says that thet mass in ${x_{1}}$ should be moved to ${T(x_{1})}$. While natural, there is a serious problem with this: What if not all mass at ${x_{1}}$ should go to the same point in ${\Omega_{2}}$? This happens in simple situations where all mass in ${\Omega_{1}}$ sits in just one point, but there are at least two different points in ${\Omega_{2}}$ where mass should end up. This is not going to work with a map ${T}$ as above. So, the map ${T}$ is not flexible enough to model all kinds of transport we may need.

What we want is a way to distribute mass from one point in ${\Omega_{1}}$ to the whole set ${\Omega_{2}}$. This looks like we want maps ${\mathcal{T}}$ which map points in ${\Omega_{1}}$ to functions on ${\Omega_{2}}$, i.e. something like ${\mathcal{T}:\Omega_{1}\rightarrow (\Omega_{2}\rightarrow{\mathbb R})}$ where ${(\Omega_{2}\rightarrow{\mathbb R})}$ stands for some set of functions on ${\Omega_{2}}$. We can de-curry this function to some ${\tau:\Omega_{1}\times\Omega_{2}\rightarrow{\mathbb R}}$ by ${\tau(x_{1},x_{2}) = \mathcal{T}(x_{1})(x_{2})}$. That’s good in principle, be we still run into problems when the target mass distribution ${\mu_{2}}$ is singular in the sense that ${\Omega_{2}}$ is a “continuous” set and there are single points in ${\Omega_{2}}$ that carry some mass according to ${\mu_{2}}$. Since we are in the world of measure theory already, the way out suggests itself: Instead of a function ${\tau}$ on ${\Omega_{1}\times\Omega_{2}}$ we look for a measure ${\pi}$ on ${\Omega_{1}\times \Omega_{2}}$ as a transport plan.

The demand that we should carry all of the mass in ${\Omega_{1}}$ to reach all of ${\mu_{2}}$ is formulated by marginals. For simplicity we just write these constraints as

$\displaystyle \int_{\Omega_{2}}\pi\, d x_{2} = \mu_{1},\qquad \int_{\Omega_{1}}\pi\, d x_{1} = \mu_{2}$

(with the understanding that the first equation really means that for all continuous function ${f:\Omega_{1}\rightarrow {\mathbb R}}$ it holds that ${\int_{\Omega_{1}\times \Omega_{2}} f(x_{1})\,d\pi(x_{1},x_{2}) = \int_{\Omega_{1}}f(x_{1})\,d\mu_{1}(x_{1})}$).

This leads us to the full transport problem

$\displaystyle \min_{\pi}\int_{\Omega_{1}\times \Omega_{2}}c(x,y)\,d\pi(x_{1}x_{2})\quad \text{s.t.}\quad \int_{\Omega_{2}}\pi\, d x_{2} = \mu_{1},\quad \int_{\Omega_{1}}\pi\, d x_{1} = \mu_{2}.$

There is the following theorem which characterizes optimality of a plan and which is the topic of this post:

Theorem 1 (Fundamental theorem of optimal transport) Under some technicalities we can say that a plan ${\pi}$ which fulfills the marginal constraints is optimal if and only if one of the following equivalent conditions is satisfied:

1. The support ${\mathrm{supp}(\pi)}$ of ${\pi}$ is ${c}$-cyclically monotone.
2. There exists a ${c}$-concave function ${\phi}$ such that its ${c}$-superdifferential contains the support of ${\pi}$, i.e. ${\mathrm{supp}(\pi)\subset \partial^{c}\phi}$.

A few clarifications: The technicalities involve continuity, integrability, and boundedness conditions of ${c}$ and integrability conditions on the marginals. The full theorem can be found as Theorem 1.13 in A user’s guide to optimal transport by Ambrosio and Gigli. Also the notions ${c}$-cyclically monotone, ${c}$-concave and ${c}$-superdifferential probably need explanation. We start with a simpler notion: ${c}$-monotonicity:

Definition 2 A set ${\Gamma\subset\Omega_{1}\times\Omega_{2}}$ is ${c}$-monotone, if for all ${(x_{1}x_{2}),(x_{1}',x_{2}')\in\Gamma}$ it holds that

$\displaystyle c(x_{1},x_{2}) + c(x_{1}',x_{2}')\leq c(x_{1},x_{2}') + c(x_{1}',x_{2}).$

If you find it unclear what this has to do with monotonicity, look at this example:

Example 1 Let ${\Omega_{1/2}\in{\mathbb R}^{d}}$ and let ${c(x_{1},x_{2}) = \langle x_{1},x_{2}\rangle}$ be the usual scalar product. Then ${c}$-monotonicity is the condition that for all ${(x_{1}x_{2}),(x_{1}',x_{2}')\in\Gamma\subset\Omega_{1}\times\Omega_{2}}$ it holds that

$\displaystyle 0\leq \langle x_{1}-x_{1}',x_{2}-x_{2}'\rangle$

which may look more familiar. Indeed, when ${\Omega_{1}}$ and ${\Omega_{2}}$ are subset of the real line, the above conditions means that the set ${\Gamma}$ somehow “moves up in ${\Omega_{2}}$” if we “move right in ${\Omega_{1}}$”. So ${c}$-monotonicity for ${c(x_{1},x_{2}) = \langle x_{1},x_{2}\rangle}$ is something like “monotonically increasing”. Similarly, for ${c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle}$, ${c}$-monotonicity means “monotonically decreasing”.

You may say that both ${c(x_{1},x_{2}) = \langle x_{1},x_{2}\rangle}$ and ${c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle}$ are strange cost functions and I can’t argue with that. But here comes: ${c(x_{1},x_{2}) = |x_{1}-x_{2}|^{2}}$ (${|\,\cdot\,|}$ being the euclidean norm) seems more natural, right? But if we have a transport plan ${\pi}$ for this ${c}$ for some marginals ${\mu_{1}}$ and ${\mu_{2}}$ we also have

$\displaystyle \begin{array}{rcl} \int_{\Omega_{1}\times \Omega_{2}}c(x_{1},x_{2})d\pi(x_{1},x_{2}) & = & \int_{\Omega_{1}\times \Omega_{2}}|x_{1}|^{2} d\pi(x_{1},x_{2})\\ &&\quad- \int_{\Omega_{1}\times \Omega_{2}}\langle x_{1},x_{2}\rangle d\pi(x_{1},x_{2})\\ && \qquad+ \int_{\Omega_{1}\times \Omega_{2}} |x_{2}|^{2}d\pi(x_{1},x_{2})\\ & = &\int_{\Omega_{1}}|x_{1}|^{2}d\mu_{1}(x_{1}) - \int_{\Omega_{1}\times \Omega_{2}}\langle x_{1},x_{2}\rangle d\pi(x_{1},x_{2}) + \int_{\Omega_{2}}|x_{2}|^{2}d\mu_{2}(x_{2}) \end{array}$

i.e., the transport cost for ${c(x_{1},x_{2}) = |x_{1}-x_{2}|^{2}}$ differs from the one for ${c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle}$ only by a constant independent of ${\pi}$, so may well use the latter.

The fundamental theorem of optimal transport uses the notion of ${c}$-cyclical monotonicity which is stronger that just ${c}$-monotonicity:

Definition 3 A set ${\Gamma\subset \Omega_{1}\times \Omega_{2}}$ is ${c}$-cyclically monotone, if for all ${(x_{1}^{i},x_{2}^{i})\in\Gamma}$, ${i=1,\dots n}$ and all permutations ${\sigma}$ of ${\{1,\dots,n\}}$ it holds that

$\displaystyle \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{i}) \leq \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{\sigma(i)}).$

For ${n=2}$ we get back the notion of ${c}$-monotonicity.

Definition 4 A function ${\phi:\Omega_{1}\rightarrow {\mathbb R}}$ is ${c}$-concave if there exists some function ${\psi:\Omega_{2}\rightarrow{\mathbb R}}$ such that

$\displaystyle \phi(x_{1}) = \inf_{x_{2}\in\Omega_{2}}c(x_{1},x_{2}) - \psi(x_{2}).$

This definition of ${c}$-concavity resembles the notion of convex conjugate:

Example 2 Again using ${c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle}$ we get that a function ${\phi}$ is ${c}$-concave if

$\displaystyle \phi(x_{1}) = \inf_{x_{2}}-\langle x_{1},x_{2}\rangle - \psi(x_{2}),$

and, as an infimum over linear functions, ${\phi}$ is clearly concave in the usual way.

Definition 5 The ${c}$-superdifferential of a ${c}$-concave function is

$\displaystyle \partial^{c}\phi = \{(x_{1},x_{2})\mid \phi(x_{1}) + \phi^{c}(x_{2}) = c(x,y)\},$

where ${\phi^{c}}$ is the ${c}$-conjugate of ${\phi}$ defined by

$\displaystyle \phi^{c}(x_{2}) = \inf_{x_{1}\in\Omega_{1}}c(x_{1},x_{2}) -\phi(x_{1}).$

Again one may look at ${c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle}$ and observe that the ${c}$-superdifferential is the usual superdifferential related to the supergradient of concave functions (there is a Wikipedia page for subgradient only, but the concept is the same with reversed signs in some sense).

Now let us sketch the proof of the fundamental theorem of optimal transport: \medskip

Proof (of the fundamental theorem of optimal transport). Let ${\pi}$ be an optimal transport plan. We aim to show that ${\mathrm{supp}(\pi)}$ is ${c}$-cyclically monotone and assume the contrary. That is, we assume that there are points ${(x_{1}^{i},x_{2}^{i})\in\mathrm{supp}(\pi)}$ and a permutation ${\sigma}$ such that

$\displaystyle \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{i}) > \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{\sigma(i)}).$

We aim to construct a ${\tilde\pi}$ such that ${\tilde\pi}$ is still feasible but has a smaller transport cost. To do so, we note that continuity of ${c}$ implies that there are neighborhoods ${U_{i}}$ of ${x_{1}^{i}}$ and ${V_{i}}$ of ${x_{2}^{i}}$ such that for all ${u_{i}\in U_{1}}$ and ${v_{i}\in V_{i}}$ it holds that

$\displaystyle \sum_{i=1}^{n}c(u_{i},v_{\sigma(i)}) - c(u_{i},v_{i})<0.$

We use this to construct a better plan ${\tilde \pi}$: Take the mass of ${\pi}$ in the sets ${U_{i}\times V_{i}}$ and shift it around. The full construction is a little messy to write down: Define a probability measure ${\nu}$ on the product ${X = \bigotimes_{i=1}^{N}U_{i}\times V_{i}}$ as the product of the measures ${\tfrac{1}{\pi(U_{i}\times V_{i})}\pi|_{U_{i}\times V_{i}}}$. Now let ${P^{U_{1}}}$ and ${P^{V_{i}}}$ be the projections of ${X}$ onto ${U_{i}}$ and ${V_{i}}$, respectively, and set

$\displaystyle \nu = \tfrac{\min_{i}\pi(U_{i}\times V_{i})}{n}\sum_{i=1}^{n}(P^{U_{i}},P^{V_{\sigma(i)}})_{\#}\nu - (P^{U_{i}},P^{V_{i}})_{\#}\nu$

where ${_{\#}}$ denotes the pushforward of measures. Note that the new measure ${\nu}$ is signed and that ${\tilde\pi = \pi + \nu}$ fulfills

1. ${\tilde\pi}$ is a non-negative measure
2. ${\tilde\pi}$ is feasible, i.e. has the correct marginals
3. ${\int c\,d\tilde \pi<\int c\,d\pi}$

which, all together, gives a contradiction to optimality of ${\pi}$. The implication of item 1 to item 2 of the theorem is not really related to optimal transport but a general fact about ${c}$-concavity and ${c}$-cyclical monotonicity (c.f.~this previous blog post of mine where I wrote a similar statement for convexity). So let us just prove the implication from item 2 to optimality of ${\pi}$: Let ${\pi}$ fulfill item 2, i.e. ${\pi}$ is feasible and ${\mathrm{supp}(\pi)}$ is contained in the ${c}$-superdifferential of some ${c}$-concave function ${\phi}$. Moreover let ${\tilde\pi}$ be any feasible transport plan. We aim to show that ${\int c\,d\pi\leq \int c\,d\tilde\pi}$. By definition of the ${c}$-superdifferential and the ${c}$-conjugate we have

$\displaystyle \begin{array}{rcl} \phi(x_{1}) + \phi^{c}(x_{2}) &=& c(x_{1},x_{2})\ \forall (x_{1},x_{2})\in\partial^{c}\phi\\ \phi(x_{1}) + \phi^{c}(x_{2}) & \leq& c(x_{1},x_{2})\ \forall (x_{1},x_{2})\in\Omega_{1}\times \Omega_{2}. \end{array}$

Since ${\mathrm{supp}(\pi)\subset\partial^{c}\phi}$ by assumption, this gives

$\displaystyle \begin{array}{rcl} \int_{\Omega_{1}\times \Omega_{2}}c(x_{1},x_{2})\,d\pi(x_{1},x_{2}) & =& \int_{\Omega_{1}\times \Omega_{2}}\phi(x_{1}) + \phi^{c}(x_{1})\,d\pi(x_{1},x_{2})\\ &=& \int_{\Omega_{1}}\phi(x_{1})\,d\mu_{1}(x_{1}) + \int_{\Omega_{1}}\phi^{c}(x_{2})\,d\mu_{2}(x_{2})\\ &=& \int_{\Omega_{1}\times \Omega_{2}}\phi(x_{1}) + \phi^{c}(x_{1})\,d\tilde\pi(x_{1},x_{2})\\ &\leq& \int_{\Omega_{1}\times \Omega_{2}}c(x_{1},x_{2})\,d\tilde\pi(x_{1},x_{2}) \end{array}$

which shows the claim.

${\Box}$

Corollary 6 If ${\pi}$ is a measure on ${\Omega_{1}\times \Omega_{2}}$ which is supported on a ${c}$-superdifferential of a ${c}$-concave function, then ${\pi}$ is an optimal transport plan for its marginals with respect to the transport cost ${c}$.