The problem of optimal transport of mass from one distribution to another can be stated in many forms. Here is the formulation going back to Kantorovich: We have two measurable sets {\Omega_{1}} and {\Omega_{2}}, coming with two measures {\mu_{1}} and {\mu_{2}}. We also have a function {c:\Omega_{1}\times \Omega_{2}\rightarrow {\mathbb R}} which assigns a transport cost, i.e. {c(x_{1},x_{2})} is the cost that it takes to carry one unit of mass from point {x_{1}\in\Omega_{2}} to {x_{2}\in\Omega_{2}}. What we want is a plan that says where the mass in {\mu_{1}} should be placed in {\Omega_{2}} (or vice versa). There are different ways to formulate this mathematically.

A simple way is to look for a map {T:\Omega_{1}\rightarrow\Omega_{2}} which says that thet mass in {x_{1}} should be moved to {T(x_{1})}. While natural, there is a serious problem with this: What if not all mass at {x_{1}} should go to the same point in {\Omega_{2}}? This happens in simple situations where all mass in {\Omega_{1}} sits in just one point, but there are at least two different points in {\Omega_{2}} where mass should end up. This is not going to work with a map {T} as above. So, the map {T} is not flexible enough to model all kinds of transport we may need.

What we want is a way to distribute mass from one point in {\Omega_{1}} to the whole set {\Omega_{2}}. This looks like we want maps {\mathcal{T}} which map points in {\Omega_{1}} to functions on {\Omega_{2}}, i.e. something like {\mathcal{T}:\Omega_{1}\rightarrow (\Omega_{2}\rightarrow{\mathbb R})} where {(\Omega_{2}\rightarrow{\mathbb R})} stands for some set of functions on {\Omega_{2}}. We can de-curry this function to some {\tau:\Omega_{1}\times\Omega_{2}\rightarrow{\mathbb R}} by {\tau(x_{1},x_{2}) = \mathcal{T}(x_{1})(x_{2})}. That’s good in principle, be we still run into problems when the target mass distribution {\mu_{2}} is singular in the sense that {\Omega_{2}} is a “continuous” set and there are single points in {\Omega_{2}} that carry some mass according to {\mu_{2}}. Since we are in the world of measure theory already, the way out suggests itself: Instead of a function {\tau} on {\Omega_{1}\times\Omega_{2}} we look for a measure {\pi} on {\Omega_{1}\times \Omega_{2}} as a transport plan.

The demand that we should carry all of the mass in {\Omega_{1}} to reach all of {\mu_{2}} is formulated by marginals. For simplicity we just write these constraints as

\displaystyle \int_{\Omega_{2}}\pi\, d x_{2} = \mu_{1},\qquad \int_{\Omega_{1}}\pi\, d x_{1} = \mu_{2}

(with the understanding that the first equation really means that for all continuous function {f:\Omega_{1}\rightarrow {\mathbb R}} it holds that {\int_{\Omega_{1}\times \Omega_{2}} f(x_{1})\,d\pi(x_{1},x_{2}) = \int_{\Omega_{1}}f(x_{1})\,d\mu_{1}(x_{1})}).

This leads us to the full transport problem

\displaystyle \min_{\pi}\int_{\Omega_{1}\times \Omega_{2}}c(x,y)\,d\pi(x_{1}x_{2})\quad \text{s.t.}\quad \int_{\Omega_{2}}\pi\, d x_{2} = \mu_{1},\quad \int_{\Omega_{1}}\pi\, d x_{1} = \mu_{2}.

There is the following theorem which characterizes optimality of a plan and which is the topic of this post:

Theorem 1 (Fundamental theorem of optimal transport) Under some technicalities we can say that a plan {\pi} which fulfills the marginal constraints is optimal if and only if one of the following equivalent conditions is satisfied:

  1. The support {\mathrm{supp}(\pi)} of {\pi} is {c}-cyclically monotone.
  2. There exists a {c}-concave function {\phi} such that its {c}-superdifferential contains the support of {\pi}, i.e. {\mathrm{supp}(\pi)\subset \partial^{c}\phi}.


A few clarifications: The technicalities involve continuity, integrability, and boundedness conditions of {c} and integrability conditions on the marginals. The full theorem can be found as Theorem 1.13 in A user’s guide to optimal transport by Ambrosio and Gigli. Also the notions {c}-cyclically monotone, {c}-concave and {c}-superdifferential probably need explanation. We start with a simpler notion: {c}-monotonicity:

Definition 2 A set {\Gamma\subset\Omega_{1}\times\Omega_{2}} is {c}-monotone, if for all {(x_{1}x_{2}),(x_{1}',x_{2}')\in\Gamma} it holds that

\displaystyle c(x_{1},x_{2}) + c(x_{1}',x_{2}')\leq c(x_{1},x_{2}') + c(x_{1}',x_{2}).


If you find it unclear what this has to do with monotonicity, look at this example:

Example 1 Let {\Omega_{1/2}\in{\mathbb R}^{d}} and let {c(x_{1},x_{2}) = \langle x_{1},x_{2}\rangle} be the usual scalar product. Then {c}-monotonicity is the condition that for all {(x_{1}x_{2}),(x_{1}',x_{2}')\in\Gamma\subset\Omega_{1}\times\Omega_{2}} it holds that

\displaystyle 0\leq \langle x_{1}-x_{1}',x_{2}-x_{2}'\rangle

which may look more familiar. Indeed, when {\Omega_{1}} and {\Omega_{2}} are subset of the real line, the above conditions means that the set {\Gamma} somehow “moves up in {\Omega_{2}}” if we “move right in {\Omega_{1}}”. So {c}-monotonicity for {c(x_{1},x_{2}) = \langle x_{1},x_{2}\rangle} is something like “monotonically increasing”. Similarly, for {c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle}, {c}-monotonicity means “monotonically decreasing”.

You may say that both {c(x_{1},x_{2}) = \langle x_{1},x_{2}\rangle} and {c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle} are strange cost functions and I can’t argue with that. But here comes: {c(x_{1},x_{2}) = |x_{1}-x_{2}|^{2}} ({|\,\cdot\,|} being the euclidean norm) seems more natural, right? But if we have a transport plan {\pi} for this {c} for some marginals {\mu_{1}} and {\mu_{2}} we also have

\displaystyle \begin{array}{rcl} \int_{\Omega_{1}\times \Omega_{2}}c(x_{1},x_{2})d\pi(x_{1},x_{2}) & = & \int_{\Omega_{1}\times \Omega_{2}}|x_{1}|^{2} d\pi(x_{1},x_{2})\\ &&\quad- \int_{\Omega_{1}\times \Omega_{2}}\langle x_{1},x_{2}\rangle d\pi(x_{1},x_{2})\\ && \qquad+ \int_{\Omega_{1}\times \Omega_{2}} |x_{2}|^{2}d\pi(x_{1},x_{2})\\ & = &\int_{\Omega_{1}}|x_{1}|^{2}d\mu_{1}(x_{1}) - \int_{\Omega_{1}\times \Omega_{2}}\langle x_{1},x_{2}\rangle d\pi(x_{1},x_{2}) + \int_{\Omega_{2}}|x_{2}|^{2}d\mu_{2}(x_{2}) \end{array}

i.e., the transport cost for {c(x_{1},x_{2}) = |x_{1}-x_{2}|^{2}} differs from the one for {c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle} only by a constant independent of {\pi}, so may well use the latter.

The fundamental theorem of optimal transport uses the notion of {c}-cyclical monotonicity which is stronger that just {c}-monotonicity:

Definition 3 A set {\Gamma\subset \Omega_{1}\times \Omega_{2}} is {c}-cyclically monotone, if for all {(x_{1}^{i},x_{2}^{i})\in\Gamma}, {i=1,\dots n} and all permutations {\sigma} of {\{1,\dots,n\}} it holds that

\displaystyle \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{i}) \leq \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{\sigma(i)}).


For {n=2} we get back the notion of {c}-monotonicity.

Definition 4 A function {\phi:\Omega_{1}\rightarrow {\mathbb R}} is {c}-concave if there exists some function {\psi:\Omega_{2}\rightarrow{\mathbb R}} such that

\displaystyle \phi(x_{1}) = \inf_{x_{2}\in\Omega_{2}}c(x_{1},x_{2}) - \psi(x_{2}).


This definition of {c}-concavity resembles the notion of convex conjugate:

Example 2 Again using {c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle} we get that a function {\phi} is {c}-concave if

\displaystyle \phi(x_{1}) = \inf_{x_{2}}-\langle x_{1},x_{2}\rangle - \psi(x_{2}),

and, as an infimum over linear functions, {\phi} is clearly concave in the usual way.

Definition 5 The {c}-superdifferential of a {c}-concave function is

\displaystyle \partial^{c}\phi = \{(x_{1},x_{2})\mid \phi(x_{1}) + \phi^{c}(x_{2}) = c(x,y)\},

where {\phi^{c}} is the {c}-conjugate of {\phi} defined by

\displaystyle \phi^{c}(x_{2}) = \inf_{x_{1}\in\Omega_{1}}c(x_{1},x_{2}) -\phi(x_{1}).


Again one may look at {c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle} and observe that the {c}-superdifferential is the usual superdifferential related to the supergradient of concave functions (there is a Wikipedia page for subgradient only, but the concept is the same with reversed signs in some sense).

Now let us sketch the proof of the fundamental theorem of optimal transport: \medskip

Proof (of the fundamental theorem of optimal transport). Let {\pi} be an optimal transport plan. We aim to show that {\mathrm{supp}(\pi)} is {c}-cyclically monotone and assume the contrary. That is, we assume that there are points {(x_{1}^{i},x_{2}^{i})\in\mathrm{supp}(\pi)} and a permutation {\sigma} such that

\displaystyle \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{i}) > \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{\sigma(i)}).

We aim to construct a {\tilde\pi} such that {\tilde\pi} is still feasible but has a smaller transport cost. To do so, we note that continuity of {c} implies that there are neighborhoods {U_{i}} of {x_{1}^{i}} and {V_{i}} of {x_{2}^{i}} such that for all {u_{i}\in U_{1}} and {v_{i}\in V_{i}} it holds that

\displaystyle \sum_{i=1}^{n}c(u_{i},v_{\sigma(i)}) - c(u_{i},v_{i})<0.

We use this to construct a better plan {\tilde \pi}: Take the mass of {\pi} in the sets {U_{i}\times V_{i}} and shift it around. The full construction is a little messy to write down: Define a probability measure {\nu} on the product {X = \bigotimes_{i=1}^{N}U_{i}\times V_{i}} as the product of the measures {\tfrac{1}{\pi(U_{i}\times V_{i})}\pi|_{U_{i}\times V_{i}}}. Now let {P^{U_{1}}} and {P^{V_{i}}} be the projections of {X} onto {U_{i}} and {V_{i}}, respectively, and set

\displaystyle \nu = \tfrac{\min_{i}\pi(U_{i}\times V_{i})}{n}\sum_{i=1}^{n}(P^{U_{i}},P^{V_{\sigma(i)}})_{\#}\nu - (P^{U_{i}},P^{V_{i}})_{\#}\nu

where {_{\#}} denotes the pushforward of measures. Note that the new measure {\nu} is signed and that {\tilde\pi = \pi + \nu} fulfills

  1. {\tilde\pi} is a non-negative measure
  2. {\tilde\pi} is feasible, i.e. has the correct marginals
  3. {\int c\,d\tilde \pi<\int c\,d\pi}

which, all together, gives a contradiction to optimality of {\pi}. The implication of item 1 to item 2 of the theorem is not really related to optimal transport but a general fact about {c}-concavity and {c}-cyclical monotonicity (c.f.~this previous blog post of mine where I wrote a similar statement for convexity). So let us just prove the implication from item 2 to optimality of {\pi}: Let {\pi} fulfill item 2, i.e. {\pi} is feasible and {\mathrm{supp}(\pi)} is contained in the {c}-superdifferential of some {c}-concave function {\phi}. Moreover let {\tilde\pi} be any feasible transport plan. We aim to show that {\int c\,d\pi\leq \int c\,d\tilde\pi}. By definition of the {c}-superdifferential and the {c}-conjugate we have

\displaystyle \begin{array}{rcl} \phi(x_{1}) + \phi^{c}(x_{2}) &=& c(x_{1},x_{2})\ \forall (x_{1},x_{2})\in\partial^{c}\phi\\ \phi(x_{1}) + \phi^{c}(x_{2}) & \leq& c(x_{1},x_{2})\ \forall (x_{1},x_{2})\in\Omega_{1}\times \Omega_{2}. \end{array}

Since {\mathrm{supp}(\pi)\subset\partial^{c}\phi} by assumption, this gives

\displaystyle \begin{array}{rcl} \int_{\Omega_{1}\times \Omega_{2}}c(x_{1},x_{2})\,d\pi(x_{1},x_{2}) & =& \int_{\Omega_{1}\times \Omega_{2}}\phi(x_{1}) + \phi^{c}(x_{1})\,d\pi(x_{1},x_{2})\\ &=& \int_{\Omega_{1}}\phi(x_{1})\,d\mu_{1}(x_{1}) + \int_{\Omega_{1}}\phi^{c}(x_{2})\,d\mu_{2}(x_{2})\\ &=& \int_{\Omega_{1}\times \Omega_{2}}\phi(x_{1}) + \phi^{c}(x_{1})\,d\tilde\pi(x_{1},x_{2})\\ &\leq& \int_{\Omega_{1}\times \Omega_{2}}c(x_{1},x_{2})\,d\tilde\pi(x_{1},x_{2}) \end{array}

which shows the claim.


Corollary 6 If {\pi} is a measure on {\Omega_{1}\times \Omega_{2}} which is supported on a {c}-superdifferential of a {c}-concave function, then {\pi} is an optimal transport plan for its marginals with respect to the transport cost {c}.



This is a short follow up on my last post where I wrote about the sweet spot of the stepsize of the Douglas-Rachford iteration. For the case \beta-Lipschitz + \mu-strongly monotone, the iteration with stepsize t converges linear with rate

\displaystyle r(t) = \tfrac{1}{2(1+t\mu)}\left(\sqrt{2t^{2}\mu^{2}+2t\mu + 1 +2(1 - \tfrac{1}{(1+t\beta)^{2}} - \tfrac1{1+t^{2}\beta^{2}})t\mu(1+t\mu)} + 1\right)

Here is animated plot of this contraction factor depending on \beta and \mu and t acts as time variable:


What is interesting is, that this factor has increasing or decreasing in t depending on the values of \beta and \mu.

For each pair (\beta,\mu) there is a best t^* and also a smallest contraction factor r(t^*). Here are plots of these quantities:

Comparing the plot of te optimal contraction factor to the animated plot above, you see that the right choice of the stepsize matters a lot.

I blogged about the Douglas-Rachford method before here and here. It’s a method to solve monotone inclusions in the form

\displaystyle 0 \in Ax + Bx

with monotone multivalued operators {A,B} from a Hilbert space into itself. Using the resolvent {J_{A} = (I+A)^{-1}} and the reflector {R_{A} = 2J_{A} - I}, the Douglas-Rachford iteration is concisely written as

\displaystyle u^{n+1} = \tfrac12(I + R_{B}R_{A})u_{n}.

The convergence of the method has been clarified is a number of papers, see, e.g.

Lions, Pierre-Louis, and Bertrand Mercier. “Splitting algorithms for the sum of two nonlinear operators.” SIAM Journal on Numerical Analysis 16.6 (1979): 964-979.

for the first treatment in the context of monotone operators and

Svaiter, Benar Fux. “On weak convergence of the Douglas–Rachford method.” SIAM Journal on Control and Optimization 49.1 (2011): 280-287.

for a recent very general convergence result.

Since {tA} is monotone if {A} is monotone and {t>0}, we can introduce a stepsize for the Douglas-Rachford iteration

\displaystyle u^{n+1} = \tfrac12(I + R_{tB}R_{tA})u^{n}.

It turns out, that this stepsize matters a lot in practice; too small and too large stepsizes lead to slow convergence. It is a kind of folk wisdom, that there is “sweet spot” for the stepsize. In a recent preprint Quoc Tran-Dinh and I investigated this sweet spot in the simple case of linear operators {A} and {B} and this tweet has a visualization.

A few days ago Walaa Moursi and Lieven Vandenberghe published the preprint “Douglas-Rachford splitting for a Lipschitz continuous and a strongly monotone operator” and derived some linear convergence rates in the special case they mention in the title. One result (Theorem 4.3) goes as follows: If {A} is monotone and Lipschitz continuous with constant {\beta} and {B} is maximally monotone and {\mu}-strongly monotone, than the Douglas-Rachford iterates converge strongly to a solution with a linear rate

\displaystyle r = \tfrac{1}{2(1+\mu)}\left(\sqrt{2\mu^{2}+2\mu + 1 +2(1 - \tfrac{1}{(1+\beta)^{2}} - \tfrac1{1+\beta^{2}})\mu(1+\mu)} + 1\right).

This is a surprisingly complicated expression, but there is a nice thing about it: It allows to optimize for the stepsize! The rate depends on the stepsize as

\displaystyle r(t) = \tfrac{1}{2(1+t\mu)}\left(\sqrt{2t^{2}\mu^{2}+2t\mu + 1 +2(1 - \tfrac{1}{(1+t\beta)^{2}} - \tfrac1{1+t^{2}\beta^{2}})t\mu(1+t\mu)} + 1\right)

and the two plots of this function below show the sweet spot clearly.


If one knows the Lipschitz constant of {A} and the constant of strong monotonicity of {B}, one can minimize {r(t)} to get on optimal stepsize (in the sense that the guaranteed contraction factor is as small as possible). As Moursi and Vandenberghe explain in their Remark 5.4, this optimization involves finding the root of a polynomial of degree 5, so it is possible but cumbersome.

Now I wonder if there is any hope to show that the adaptive stepsize Quoc and I proposed here (which basically amounts to {t_{n} = \|u^{n}\|/\|Au^{n}\|} in the case of single valued {A} – note that the role of {A} and {B} is swapped in our paper) is able to find the sweet spot (experimentally it does).

The fundamental theorem of calculus relates the derivative of a function to the function itself via an integral. A little bit more precise, it says that one can recover a function from its derivative up to an additive constant (on a simply connected domain).

In one space dimension, one can fix some {x_{0}} in the domain (which has to be an interval) and then set

\displaystyle F(x) = \int_{x_{0}}^{x}f'(t) dt.

Then {F(x) = f(x) + c} with {c= - f(x_{0})}.

Actually, a similar claim is true in higher space dimensions: If {f} is defined on a simply connected domain in {{\mathbb R}^{d}} we can recover {f} from its gradient up to an additive constant as follows: Select some {x_{0}} and set

\displaystyle F(x) = \int_{\gamma}\nabla f(x)\cdot dx \ \ \ \ \ (1)

for any path {\gamma} from {x_{0}} to {x}. Then it holds under suitable conditions that

\displaystyle F(x) = f(x) - f(x_{0}).

And now for something completely different: Convex functions and subgradients.

A function {f} on {{\mathbb R}^{n}} is convex if for every {x} there exists a subgradient {x^{*}\in{\mathbb R}^{n}} such that for all {y} one has the subgradient inequality

\displaystyle f(y) \geq f(x) + \langle x^{*}, y-x\rangle.

Writing this down for {x} and {y} with interchanged roles (and {y^{*}} as corresponding subgradient to {y}), we see that

\displaystyle \langle x-y,x^{*}-y^{*}\rangle \geq 0.

In other words: For a convex function {f} it holds that the subgradient {\partial f} is a (multivalued) monotone mapping. Recall that a multivalued map {A} is monotone, if for every {y \in A(x)} and {y^{*}\in A(y)} it holds that {\langle x-y,x^{*}-y^{*}\rangle \geq 0}. It is not hard to see that not every monotone map is actually a subgradient of a convex function (not even, if we go to “maximally monotone maps”, a notion that we sweep under the rug in this post). A simple counterexample is a (singlevalued) linear monotone map represented by

\displaystyle \begin{bmatrix} 0 & 1\\ -1 & 0 \end{bmatrix}

(which can not be a subgradient of some {f}, since it is not symmetric).

Another hint that monotonicity of a map does not imply that the map is a subgradient is that subgradients have some stronger properties than monotone maps. Let us write down the subgradient inequalities for a number of points {x_{0},\dots x_{n}}:

\displaystyle \begin{array}{rcl} f(x_{1}) & \geq & f(x_{0}) + \langle x_{0}^{*},x_{1}-x_{0}\rangle\\ f(x_{2}) & \geq & f(x_{1}) + \langle x_{1}^{*},x_{2}-x_{1}\rangle\\ \vdots & & \vdots\\ f(x_{n}) & \geq & f(x_{n-1}) + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle\\ f(x_{0}) & \geq & f(x_{n}) + \langle x_{n}^{*},x_{0}-x_{n}\rangle. \end{array}

If we sum all these up, we obtain

\displaystyle 0 \geq \langle x_{0}^{*},x_{1}-x_{0}\rangle + \langle x_{1}^{*},x_{2}-x_{1}\rangle + \cdots + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle + \langle x_{n}^{*},x_{0}-x_{n}\rangle.

This property is called {n}-cyclical monotonicity. In these terms we can say that a subgradient of a convex function is cyclical monotone, which means that it is {n}-cyclically monotone for every integer {n}.

By a remarkable result by Rockafellar, the converse is also true:

Theorem 1 (Rockafellar, 1966) Let {A} by a cyclically monotone map. Then there exists a convex function {f} such that {A = \partial f}.

Even more remarkable, the proof is somehow “just an application of the fundamental theorem of calculus” where we recover a function by its subgradient (up to an additive constant that depends on the basepoint).

Proof: we aim to “reconstruct” {f} from {A = \partial f}. The trick is to choose some base point {x_{0}} and corresponding {x_{0}^{*}\in A(x_{0})} and set

\displaystyle \begin{array}{rcl} f(x) &=& \sup\left\{ \langle x_{0}^{*}, x_{1}-x_{0}\rangle + \langle x_{1}^{*},x_{2}-x_{1}\rangle + \cdots + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle\right.\\&& \qquad\left. + \langle x_{n}^{*},x-x_{n}\rangle\right\}\qquad\qquad (0)\end{array}

where the supremum is taken over all integers {m} and all pairs {x_{i}^{*}\in A(x_{i})} {i=1,\dots, m}. As a supremum of affine functions {f} is clearly convex (even lower semicontinuous) and {f(x_{0}) = 0} since {A} is cyclically monotone (this shows that {f} is proper, i.e. not equal to {\infty} everywhere). Finally, for {\bar x^{*}\in A(\bar x)} we have

\displaystyle \begin{array}{rcl} f(x) &\geq& \sup\left\{ \langle x_{0}^{*}, x_{1}-x_{0}\rangle + \langle x_{1}^{*},x_{2}-x_{1}\rangle + \cdots + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle\right.\\ & & \qquad \left.+ \langle x_{n}^{*},\bar x-x_{n}\rangle + \langle \bar x^{*},\bar x-\bar x\rangle\right\} \end{array}

with the supremum taken over all integers {m} and all pairs {x_{i}^{*}\in A(x_{i})} {i=1,\dots, m}. The right hand side is equal to {f(x) + \langle \bar x^{*},x-\bar x\rangle} and this shows that {f} is indeed convex. \Box

Where did we use the fundamental theorem of calculus? Let us have another look at equation~(0). Just for simplicity, let us denote {x_{i}^{*} =\nabla f(x_{i})}. Now consider a path {\gamma} from {x_{0}} to {x} and points {0=t_{0}< t_{1}<\cdots < t_{n}< t_{n+1} = 1} with {\gamma(t_{i}) = x_{i}}. Then the term inside the supremum of~(0) equals

\displaystyle \langle \nabla f(\gamma(t_{0})),\gamma(t_{1})-\gamma(t_{0})\rangle + \dots + \langle \nabla f(\gamma(t_{n})),\gamma(t_{n+1})-\gamma(t_{n})\rangle.

This is Riemannian sum for an integral of the form {\int_{\gamma}\nabla f(x)\cdot dx}. By monotonicity of {f}, we increase this sum, if we add another point {\bar t} (e.g. {t_{i}<\bar t<t_{i+1}}, and hence, the supremum does converge to the integral, i.e.~(0) is equal to

\displaystyle f(x) = \int_{\gamma}\nabla f(x)\cdot dx

where {\gamma} is any path from {x_{0}} to {x}.

In my last blog post I wrote about Luxemburg norms which are constructions to get norms out of a non-homogeneous function {\phi:[0,\infty[\rightarrow[0,\infty[} which satisfies {\phi(0) = 0} and are increasing and convex (and thus, continuous). The definition of the Luxemburg norm in this case is

\displaystyle \|x\|_{\phi} := \inf\left\{\lambda>0\ :\ \sum_{k}\phi\left(\frac{|x_{k}|}{\lambda}\right)\leq 1\right\},

and we saw that {\|x\|_{\phi} = c} if {\sum_{k}\phi\left(\frac{|x_{k}|}{c}\right)= 1}.

Actually, one can have a little more flexibility in the construction as one can also use different functions {\phi} in each coordinate: If {\phi_{k}} are functions as {\phi} above, we can define

\displaystyle \|x\|_{\phi_{k}} := \inf\left\{\lambda>0\ :\ \sum_{k}\phi_{k}\left(\frac{|x_{k}|}{\lambda}\right)\leq 1\right\},

and it still holds that {\|x\|_{\phi_{k}} = c} if and only if {\sum_{k}\phi_{k}\left(\frac{|x_{k}|}{c}\right)= 1}. The proof that this construction indeed gives a norm is the same as in the one in the previous post.

This construction allows, among other things, to construct norms that behave like different {p}-norms in different directions. Here is a simple example: In the case of {x\in{\mathbb R}^{d}} we can split the variables into two groups, say the first {k} variables and the last {d-k} variables. The first group shall be treated with a {p}-norm and the second group with a {q}-norm. For the respective Luxemburg norm one has

\displaystyle \|x\|:= \inf\left\{\lambda>0\ :\ \sum_{i=1}^{k}\frac{|x_{i}|^{p}}{\lambda^{p}} + \sum_{i=k+1}^{d}\frac{|x_{i}|^{q}}{\lambda^{q}}\leq 1\right\},

Note that there is a different way to do a similar thing, namely a mixed norm defined as

\displaystyle \|x\|_{p,q}:= \left(\sum_{i=1}^{k}|x_{i}|^{p}\right)^{1/p} + \left(\sum_{i=k+1}^{d}|x_{i}|^{q}\right)^{1/q}.

As any two norms, these are equivalent, but they induce a different geometry. On top of that, one could in principle also consider {\Phi} functionals

\displaystyle \Phi_{p,q}(x) = \sum_{i=1}^{k}|x_{i}|^{p} + \sum_{i=k+1}^{d}|x_{i}|^{q},

which is again something different.

A bit more general, we may consider all these three conditions for general partitions of the index sets and a different exponent for each set.

Here are some observations on the differences:

  • For the Luxemburg norm the only thing that counts are the exponents (or functions {\phi_{k}}). If you partition the index set into two parts but give the same exponents to both, the Luxemburg norm is the same as if you would consider the two parts as just one part.
  • The mixed norm is not the {p}-norm, even if the set the exponent to {p} for every part.
  • The Luxemburg norm has the flexibility to use other functionals than just the powers.
  • For the mixed norms one could consider additional mixing by not just summing the norms of the different parts, which is the same as taking the {1}-norm of the vector of norms. Of course, other norms are possible, e.g. constructions like

    \displaystyle \left(\left(\sum_{i=1}^{k}|x_{i}|^{p}\right)^{r/p} + \left(\sum_{i=k+1}^{d}|x_{i}|^{q}\right)^{r/q}\right)^{1/r}

    are also norms. (Actually, the term mixed norm is often used for the above construction with {p=q\neq r}.)

Here are some pictures that show the different geometry that these three functionals induce. We consider {d=3} i.e., three-dimensional space, and picture the norm-balls (of level sets in the case the functionals {\Phi}).

  • Consider the case {k=1} and the first exponent to be {p=1} and the second {q=2}. The mixed norm is

    \displaystyle \|x\|_{1,2} = |x_{1}| + \sqrt{x_{2}^{2}+x_{3}^{2}},

    the {\Phi}-functional is

    \displaystyle \Phi(x)_{1,2} = |x_{1}| + x_{2}^{2}+x_{3}^{2},

    and for the Luxemburg norm it holds that

    \displaystyle \|x\| = c\iff \frac{|x_{1}|}{c} + \frac{x_{2}^{2} + x_{3}^{2}}{c^{2}} = 1.

    Here are animated images of the respective level sets/norm balls for radii {0.25, 0.5, 0.75,\dots,3}:balls_122

    You may note the different shapes of the balls of the mixed norm and the Luxemburg norm. Also, the shape of their norm balls stays the same as you scale the radius. The last observation is not true for the {\Phi} functional: Different directions scale differently.

  • Now consider {k=2} and the same exponents. This makes the mixed norm equal to the {1}-norm, since

    \displaystyle \|x\|_{1,2} = |x_{1}| + |x_{2}| + \sqrt{x_{3}^{2}} = \|x\|_{1}.

    The {\Phi}-functional is

    \displaystyle \Phi(x)_{1,2} = |x_{1}| + |x_{2}|+x_{3}^{2},

    and for the Luxemburg norm it holds that

    \displaystyle \|x\| = c\iff \frac{|x_{1}| + |x_{2}|}{c} + \frac{x_{3}^{2}}{c^{2}} = 1.

    Here are animated images of the respective level sets/norm balls of the {\Phi} functional and the Luxemburg norm for the same radii as above (I do not show the balls for the mixed norm – they are just the standard cross-polytopes/{1}-norm balls/octahedra): balls_112Again note how the Luxemburg ball keeps its shape while the level sets of the {\Phi}-functional changes shape while scaling.

  • Now we consider three different exponents: {p_{1}=1}, {p_{2} = 2} and {p_{3} = 3}. The mixed norm is again the {1}-norm. The {\Phi}-functional is

    \displaystyle \Phi(x)_{1,2} = |x_{1}| + x_{2}^{2}+|x_{3}|^{3},

    and for the Luxemburg norm it holds that

    \displaystyle \|x\| = c\iff \frac{|x_{1}|}{c} + \frac{x_{2}^{2}}{c^{2}} + \frac{|x_{3}|^{3}}{c^{3}} = 1.

    Here are animated images of the respective level sets/norm balls of the {\Phi} functional and the Luxemburg norm for the same radii as above (again, the balls for the mixed norm are just the standard cross-polytopes/{1}-norm balls/octahedra):balls_123


A function {\phi:[0,\infty[\rightarrow[0,\infty[} is called {p}-homogeneous, if for any {t>0} it holds that {\phi(tx) = t^{p}\phi(x)}. This implies that {\phi(x) = x^{p}\phi(1)} and thus, all {p}-homogeneous functions on the positive reals are just multiples of powers of {p}. If you have such a power of {p} you can form the {p}-norm

\displaystyle \|x\|_{p} = \left(\sum_{k}|x_{k}|^{p}\right)^{1/p}.

By Minkowski’s inequality, this in indeed a norm for {p\geq 1}.

If we have just some function {\phi:[0,\infty[\rightarrow[0,\infty[} that is not homogeneous, we could still try to do a similar thing and consider

\displaystyle \phi^{-1}\left(\sum_{k}\phi(|x_{k}|)\right).

It is easy to see that one needs {\phi(0)=0} and {\phi} increasing and invertible to have any chance that this expression can be a norm. However, one usually does not get positive homogeneity of the expression, i.e. in general

\displaystyle \phi^{-1}\left(\sum_{k}\phi(t|x_{k}|)\right)\neq t \phi^{-1}\left(\sum_{k}\phi(|x_{k}|)\right).

A construction that helps in this situation is the Luxemburg-norm. The definition is as follows:

Definition 1 (and lemma). Let {\phi:[0,\infty[\rightarrow[0,\infty[} fulfill {\phi(0)=0}, {\phi} be increasing and convex. Then we define the Luxemburg norm for {\phi} as

\displaystyle \|x\|_{\phi} := \inf\{\lambda>0\ :\ \sum_{k}\phi\left(\frac{|x_{k}|}{\lambda}\right)\leq 1\}.

Let’s check if this really is a norm. To do so we make the following observation:

Lemma 2 If {x\neq 0}, then {c = \|x\|_{\phi}} if and only if {\sum_{k}\phi\left(\frac{|x_{k}|}{c}\right) = 1}.

Proof: Basically follows by continuity of {\phi} from the fact that for {\lambda >c} we have {\sum_{k}\phi\left(\frac{|x_{k}|}{\lambda}\right) \leq 1} and for {\lambda<c} we have {\sum_{k}\phi\left(\frac{|x_{k}|}{\lambda}\right) > 1}. \Box

Lemma 3 {\|x\|_{\phi}} is a norm on {{\mathbb R}^{d}}.

Proof: For {x=0} we easily see that {\|x\|_{\phi}=0} (since {\phi(0)=0}). Conversely, if {\|x\|_{\phi}=0}, then {\limsup_{\lambda\rightarrow 0}\sum_{k}\phi\left(\tfrac{|x_{k}|}{\lambda}\right) \leq 1} but since {\lim_{t\rightarrow\infty}\phi(t) = \infty} this can only hold if {x_{1}=\cdots=x_{d}= 0}. For positive homogeneity observe

\displaystyle \begin{array}{rcl} \|tx\|_{\phi} & = & \inf\{\lambda>0\ :\ \sum_{k}\phi\left(\frac{|tx_{k}|}{\lambda}\right)\leq 1\}\\ & = & \inf\{|t|\mu>0\ :\ \sum_{k}\phi\left(\frac{|x_{k}|}{\mu}\right)\leq 1\}\\ & = & |t|\|x\|_{\phi}. \end{array}

For the triangle inequality let {c = \|x\|_{\phi}} and {d = \|y\|_{\phi}} (which implies that {\sum_{k}\phi\left(\tfrac{|x_{k}|}{c}\right)\leq 1} and {\sum_{k}\phi\left(\tfrac{|y_{k}|}{d}\right)\leq 1}). Then it follows

\displaystyle \begin{array}{rcl} \sum_{k}\phi\left(\tfrac{|x_{k}+y_{k}|}{c+d}\right) &\leq& \sum_{k}\phi\left(\tfrac{c}{c+d}\tfrac{|x_{k}|}{c} +\tfrac{d}{c+d}\tfrac{|y_{k}|}{d}\right)\\ &\leq& \tfrac{c}{c+d}\underbrace{\sum_{k} \phi\left(\tfrac{|x_{k}|}{c}\right)}_{\leq 1} + \tfrac{d}{c+d}\underbrace{\sum_{k}\phi\left(\tfrac{|y_{k}|}{d}\right)}_{\leq 1}\\ &\leq& 1 \end{array}

and this implies that {c+d \geq \|x+y\|_{\phi}} as desired. \Box

As a simple exercise you can convince yourself that {\phi(t) = t^{p}} lead to {\|x\|_{\phi} = \|x\|_{p}}.

Let us see how the Luxemburg norm looks for other functions.

Example 1 Let ‘s take {\phi(t) = \exp(t)-1}.


The function {\phi} fulfills the conditions we need and here are the level lines of the functional {x\mapsto \phi^{-1}\left(\sum_{k}\phi(|x_{k}|)\right)} (which is not a norm):


[Levels are {0.5, 1, 2, 3}]

The picture shows that this functional is not a norm ad the shape of the “norm-balls” changes with the size. In contrast to that, the level lines of the respective Luxemburg norm look like this:


[Levels are {0.5, 1, 2, 3}]


I have on open position for a PhD student – here is the official job-ad:

The group of Prof. Dirk Lorenz at the Institute of Analysis and Algebra has an open PhD position for a Scientific Assistant (75\% TV-L EG 13). The position is available as soon as possible and is initially limited to three years.

The scientific focus of the group includes optimization for inverse problems and machine learning, and mathematical imaging. Besides teaching and research, the position includes work in projects or preperation of projects.

We offer

  • a dynamic team and a creative research and work environment
  • mentoring and career planning programs (offered by TU Braunschweig), possibilities for personal qualification, language courses and the possibility to participate in international conferences
  • flexible work hours and a family friendly work environment.

We are looking for candidates with

  • a degree (Masters or Diploma) in mathematics above average,
  • a focus on optimization and/or numerical mathematics and applications, e.g. imaging or machine learning
  • programming skills in MATLAB, Julia and/or Python
  • good knowledge of German and/or English
  • capacity for teamwork, independent work, high level of motivation and organizational talent

Equally qualified severely challenged persons will be given preference. The TU Braunschweig especially encourages women to apply for the position. Please send your application including CV, copies of certificates and letters of recommendation (if any) in electronic form via e-mail to Dirk Lorenz.

Application deadline: 30.06.2018
Contact Prof. Dirk Lorenz | Tel. +49 531 391 7423|

Please forward to anyone interested!