Do you remember free time as a kid that you wasted with connecting dots? If you actually liked it, here’s some good news: There are dot-to-dot books for grown-ups! Most notable, there are the books by Thomas Pavitte with great pictures with 1000 dots.

So, these are some of the books

and here is some video:


Actually, it takes some time to connect 1000 dots; I need ten minutes or so, depending a bit an the picture.

For the fun of it, I coded some lines in MATLAB to connect the dots automatically. And since I am a lazy programmer, I did not bother to connect the dots in the manner that was prescribed by the artist but more efficiently:

1. Greedy paths

For the greedy path, we start at some randomly chosen dot and connect the dot where we are with the closest possible dot where we haven’t been already.

Here’s how this looks for one of Thomas’ pictures:


2. Shortest paths

The greedy path sometimes makes very large jumps (when it runs into some corner, using all the dots in the vicinity). This leads to some spurious large jumps in the picture. In used some simple heuristics to find some “locally shortest paths” through the thousand dots. (And “locally shortest” means that there are no two edges for which swapping improves the total lengths of the paths.) Actually, I started out with the goal to solve the travelling salesman problem over the thousand dots, i.e., to find the shortest path of all. Then it turned out that

  1. Solving the travelling salesman problem is not that simple to solve – well it’s one of the best investigated NP-hard optimization problems and there is no doubt that it would take my laptop only little time to solve it with 1000 dots if fed with the right code.
  2. The locally shortest path already looks quite appealing and I am not sure how the shortest path would look any different.

Here is the locally shortest path:


Oh, by the way: the image is a pug dog, the one that you can party see on this cover

Here are some more pictures (not by Thomas Pavitte). Middle is the greedy, right the locally shortest path:

This slideshow requires JavaScript.

Since the Wikipedia page of the travelling salesman problem contains a formulation as an integer linear program to solve it, I may give it a try in the future…

Im Wintersemester 2015/2016 habe ich die Vorlesung “Analysis 3” gehalten und dazu dieses Skript verfasst:

Diese Seite dient dazu, in den Kommentaren gefundene Fehler zu sammlen und hier zu dokumentieren.

Errata zur gedruckten Version:

  • Seite 16: “|\det(U)| und \int_{\mathbb{R}} statt \int_{\mathbb{-R}}
  • Seite 17, zweiter Absatz: “wobei G eine Teilmenge des \mathbb{R}^{n+1} ist”
  • Seite 34 Beispiel 18.5 ii) Argument leicht umformuliert da vorher nicht ganz korrekt
  • Seite 45: \varphi_2 = \dots = \varphi''
  • Seite 58 Satz 19.3 Schritt2. die letzte Zeile. ” Dies heißt aber, dass \varphi^1,\dots,\varphi^n linear abhängig sind. Widerspruch!”
    Korrektur: “Dies heißt aber, dass \psi^1,\dots,\psi^n linear abhängig sind. Widerspruch!”
  • Seite 59 die erste Bemerkung (zweite Zeile): Lösungen \varphi^k statt \psi^k;
  • Seite 61 Korollar 19.7 “Fundamentalmatrix zu y'=A(x)y” statt “y'=A(x)“.
  • Seite 70, Beweis von Satz 19.17: +b(x) statt -b(x).
  • Seite 80, Beipsiel 19.29, 2.: \mathrm{i}\omega ist eine Nullstelle.
  • Seite 107, Satz 21.4: \tilde\phi:\tilde T\to M.
  • Seite 116: \omega_5 = \tfrac{8\pi^2}{3}.
  • Seite 137 unten: “aus f_\nu\to f glm. folgt…”
  • Seite 140, Beweis von Lemma 23.11: \partial^\alpha\phi_\nu\stackrel{\mathcal{D}}{\to}\partial^\alpha\phi .
  • Seite 141: [x\phi(x)]_{-\infty}^0
  • Seite 152: (LE)*\rho = \delta_0*\rho

Das Skript zur zugehörigen Vorlesung “Analysis 2” ist hier zu finden.

Optimal transport in a discrete setting goes as follows: Consider two vectors {p,q\in{\mathbb R}^N} with non-negative entries and {\|p\|_1 = \|q\|_1}. You may also say that {p} and {q} are two probability distributions or discrete measures with same total mass. A transport plan for {p} and {q} is a matrix {\pi\in{\mathbb R}^{N\times N}} with non-negative entries such that

\displaystyle \sum_i \pi_{i,j} = p_j,\quad\sum_j\pi_{i,j} = q_i.

The interpretation of {\pi} is, that {\pi_{i,j}} indicates how much of the mass {p_j} sitting in the {j}-th entry of {p} is transported to the {i}-th entry of {q}. To speak about optimal transport we add an objective function to the constraints, namely, the cost {c_{i,j}} that says how much it costs to transport one unit from the {j}-th entry in {p} to the {i}-th entry in {q}. Then the optimal transport problem is

\displaystyle \min_{\pi\in{\mathbb R}^{N\times N}} \sum_{i,j}c_{i,j}\pi_{i,j}\quad\text{s.t.}\quad\sum_i \pi_{i,j} = p_j,\quad\sum_j\pi_{i,j} = q_i.

The resulting {\pi} is an optimal transport plan and the resulting objective value is the minimal cost at which {p} can be transported to {q}. In fact the minimization problem is a linear program and not only that, it’s one of the best studied linear programs and I am sure there is a lot that I don’t know about the structure of this linear program (you may have a look at these slides by Jesus De Loera to get an impression what is known about the structure of the linear program)

So it looks like discrete optimal transport is a fairly standard problem with standard solvers available. But all solvers have one severe drawback when it comes to large {N}: The optimization variable has {N^2} entries. If {N^2} is too large to store {\pi} or keep {\pi} in memory, it seems that there is not much one can do. This is the memory bottleneck for optimal transport problems.

1. Kantorovich-Rubinstein duality

In the case when the cost has the special form {c_{i,j} = |i-j|} on can reduce the memory-burden. This special cost makes sense if the indices {i} and {j} correspond to spatial locations, since then the cost {c_{i,j} = |i-j|} is just the distance from location {i} to {j}. It turns out that in this case there is a simple dual optimal transport problem, namely

\displaystyle \max_{f\in{\mathbb R}^N} f^T(p-q)\quad\text{s.t.}\quad |f_i-f_{i-1}|\leq 1,\ 2\leq i\leq N.

(This is a simple form of the Kantorovich-Rubinstein duality and works similarly if the cost {c} is any other metric on the set of indices.) The new optimization problem is still linear but the memory requirements is only {N} and not {N^2} anymore and moreover there are only {O(N)} constraints for {f}. This idea is behind the method from the paper Imaging with Kantorovich-Rubinstein discrepancy by Jan Lellmann, myself, Carola Schönlieb and Tuomo Valkonen.

2. Entropic regularization

In this post I’d like to describe another method to break through the memory bottleneck of optimal transport. This method works for basically any cost {c} but involves a little bit of smoothing/regularization.

We go from the linear program to a non-linear but still convex one by adding the negative entropy of the transport plan to the objective, i.e. we consider the objective

\displaystyle \sum_{i,j}\Big[c_{i,j}\pi_{i,j} + \gamma \pi_{i,j}(\log(\pi_{i,j}) -1)\Big]

for some {\gamma>0}.

What’s the point of doing so? Let’s look at the Lagrangian: For the constraints {\sum_i \pi_{i,j} = p_j} and {\sum_j\pi_{i,j} = q_i} we introduce the ones vector {{\mathbf 1}\in{\mathbb R}^n}, write them as {\pi^T{\mathbf 1} = p} and {\pi{\mathbf 1}=q}, add Lagrange multipliers {\alpha} and {\beta} and get

\begin{array}{rl}\mathcal{L}(\pi,\alpha,\beta) = & \sum_{i,j}\Big[c_{i,j}\pi_{i,j} + \pi_{i,j}(\log(\pi_{i,j}) -1)\Big]\\ & \quad+ \alpha^T(\pi^T{\mathbf 1}-p) + \beta^T(\pi{\mathbf 1}-q)\end{array}

The cool thing happens with the optimality condition when deriving {\mathcal{L}} with respect to {\pi_{i,j}}:

\displaystyle \partial_{\pi_{i,j}}\mathcal{L} = c_{i,j} + \gamma\log(\pi_{i,j}) + \alpha_j + \beta_i \stackrel{!}{=} 0

We can solve for {\pi_{i,j}} and get

\displaystyle \pi_{i,j} = \exp(-\tfrac{c_{i,j}}{\gamma})\exp(-\tfrac{\alpha_j}\gamma)\exp(-\tfrac{\beta_i}\gamma).

What does that say? It says that the optimal {\pi} is obtained from the matrix

\displaystyle M_{i,j} = \exp(-\tfrac{c_{i,j}}{\gamma})

with rows and columns rescaled by vectors {u_j = \exp(-\tfrac{\alpha_j}\gamma)} and {v_i = \exp(-\tfrac{\beta_i}\gamma)}, respectively, i.e.

\displaystyle \pi = \mathbf{diag}(v)M\mathbf{diag}(u).

The reduces the memory requirement from {N^2} to {2N}! The cost for doing so is the regularization by the entropy.

Actually, the existence of the vectors {u} and {v} also follows from Sinkhorn’s theorem which states that every matrix {A} with positive entries can be written as {A = D_1MD_2} with diagonal matrices and a doubly stochastic matrix {M} (i.e. one with non-negative entries and unit row and column sums). The entropic regularization for the transport plan ensures that the entries of the transport plan has indeed positive (especially non-zero) entries.

But there is even more to the story:. To calculate the vectors {u} and {v} you simply have to do the following iteration:

\displaystyle \begin{array}{rcl} u^{n+1} &=& \frac{p}{Mv^n}\\ v^{n+1} &=& \frac{q}{M^Tu^{n+1}} \end{array}

where the fraction means element-wise division. Pretty simple.

What the iteration does in the first step is simply to take the actual {v} and calculates a column scaling {u} such that the column sums match {p}. In the second step it calculates the row scaling {v} such that the row sums match {q}. This iteration is also known as Sinkhorn-Knopp algorithm.

This is pretty simple to code in MATLAB. Here is a simple code that does the above iteration (using c_{i,j} = |i-j|^2):

gamma = 10; %reg for entropy
maxiter = 100; % maxiter
map = colormap(gray);

N = 100; % size
x = linspace(0,1,N)';%spatial coordinate

% marginals
p = exp(-(x-0.2).^2*10^2) + exp(-abs(x-0.4)*20);p=p./sum(p); %for colums sum
q = exp(-(x-0.8).^2*10^2);q = q./sum(q); % for row sum

[i,j] = meshgrid(1:N);
M = exp(-(i-j).^2/gamma); % exp(-cost/gamma)

% intialize u and v
u = ones(N,1);v = ones(N,1);

% Sinkhorn-Knopp
% iteratively scale rows and columns
for k = 1:maxiter
    % update u and v
    u = p./(M*v);
    v = q./(M'*u);
    % assemble pi (only for illustration purposes)
    pi = diag(v)*M*diag(u);
    % display pi (with marginals on top and to the left)
    imagesc([p'/max(p) 0;pi/max(pi(:)) q/max(q)])

Here are some results:

079_sinkhorn40 079_sinkhorn7 079_sinkhorn20 079_sinkhorn10

(From the left to the right: \gamma=40,20,10,7. The first row of pixels is {p}, the last column is {q} and in between there is {\pi}, all things normalized such that black is the maximal value in {p}, {q} and {\pi}, respectively.)

You see that for large {\gamma}, the plan is much more smooth and not so well localized as it should be for an optimal plan.

Oh, and here is an animation of 100 iterations of Sinkhorn-Knopp showing the result after both u and v have been updated:


There is a catch with this regularization: For small {\gamma} (in this example about {\gamma\leq 6}) the method runs into problem with under- and overflow: the entries in {Mv^n} and {M^Tu^n} become very small. One can fight this effect a bit but I don’t have a convincing numerical method to deal with this, yet.It seems that the entries of the optimal u and v really have to be incredibly small and large, and I mean really small and large (in the order of 10^{300} and 10^{-300} in both u and v).

While the Sinkhorn-Knopp algorithm is already from the 1960s, its application to optimal transport seems fairly new – I learned about from in talks by Gabriel Peyré and Bernhard Schmitzer and the reference is Sinkhorn Distances: Lightspeed Computation of Optimal Transport (presented at NIPS 2013) by Marco Cuturi.

Consider the simple linear transport equation

\displaystyle  \partial_t f + v\partial_x f = 0,\quad f(x,0) = \phi(x)

with a velocity {v}. Of course the solution is

\displaystyle  f(x,t) = \phi(x-tv),

i.e. the initial datum is just transported in direction of {v}, as the name of the equation suggests. We may also view the solution {f} as not only depending on space {x} and time {t} but also dependent on the velocity {v}, i.e. we write {f(x,t,v) =\phi(x-tv)}.

Now consider that the velocity is not really known but somehow uncertain (while the initial datum {\phi} is still known exactly). Hence, it does not make too much sense to look at the exact solution {f}, because the effect of a wrong velocity will get linearly amplified in time. It seems more sensible to assume a distribution {\rho} of velocities and look at the averaged solutions that correspond to the different velocities {v}. Hence, the quantity to look at would be

\displaystyle  g(x,t) = \int_{-\infty}^\infty f(x,t,v)\rho(v) dv.

Let’s have a closer look at the averaged solution {g}. We write out {f}, perform a change of variables and end up with

\displaystyle  \begin{array}{rcl}  g(x,t) & = &\int_{-\infty}^\infty f(x,t,v)\rho(v)dv\\ & =& \int_{-\infty}^\infty \phi(x-tv)\rho(v)dv\\ & =& \int_{-\infty}^\infty \phi(x-w)\tfrac1t\rho(w/t)dw. \end{array}

In the case of a Gaussian distribution {\rho}, i.e.

\displaystyle  \rho(v) = \frac{1}{\sqrt{4\pi}}\exp\Big(-\frac{v^2}{4}\Big)

we get

\displaystyle  g(x,t) =\int_{-\infty}^\infty \phi(x-w)G(t,w)dw


\displaystyle  G(t,w) = \frac{1}{\sqrt{4\pi}\,t}\exp\Big(-\frac{w^2}{4t^2}\Big).

Now we make a time rescaling {\tau = t^2}, denote {h(x,\tau) = h(x,t^2) = g(x,t)} and see that

\displaystyle  h(x,\tau) = \int_{-\infty}^\infty \phi(x-w)\frac{1}{\sqrt{4\pi \tau}}\exp\Big(-\frac{w^2}{4\tau}\Big)dw.

So what’s the point of all this? It turns out that the averaged and time rescaled solution {h} of the transport equation indeed solves the heat equation

\displaystyle  \partial_t h - \partial_{xx} h = 0,\quad h(x,0) = \phi(x).

In other words, velocity averaging and time rescaling turn a transport equation (a hyperbolic PDE) into a diffusion equation (a parabolic PDE).

I’ve seen this derivation in a talk by Enrique Zuazua in his talk at SciCADE 2015.

To end this blog post, consider the slight generalization of the transport equation

\displaystyle  \partial_t f + \psi(x,v)\partial_x f = 0

where the velocity depends on {x} and {v}. According to Enrique Zuazua it’s open what happens here when you average over velocities…

Im Sommersemester 2015 habe ich die Vorlesung “Analysis 2” gehalten und dazu dieses Skript verfasst:

Diese Seite dient dazu, in den Kommentaren gefundene Fehler zu sammlen und hier zu dokumentieren.

Errata zur gedruckten Version:

  • p. 54, Beweis zu Satz 13.32: In der letzten Zeile müsste D_i f(x) statt D_if(0) stehen.
  • p. 61, Beweis zu Satz 14.8: Nach dem ersten Absatz müsste es heißen v-y=A(g(v))(g(v)-g(y)).
  • p. 63, Beweis zu Satz14.11: In der Zeile unter (*) ist ein „gilt“ zu viel.
  • p. 66, Beweis zu Satz14.13: In der vorletzten Zeile fehlt das „d“ von „passend“.
  • p. 82. Lemma 15.16: In der ersten Zeile muss es  j\neq k heißen.
  • p. 93, Zeile über Beispiel 16.1: Es muss heißen „..allerdings keine brauchbaren Sätze”.
  • p. 106, Beweis zu Satz 16.24: Im ersten Absatz, dritte Zeile: „Somit ist x\mapsto D_t f(x,t)…“
  • p. 108, Beispiel 16.26: dritte Gleichung, zweites Integral muss heißen
    \int_0^\infty x^3\exp(-tx)\mathrm{d}x = \tfrac{6}{t^4}.
  • p. 108, Beispiel 16.26: vorletztes Integral muss heißen \int_0^\infty x^n\exp(-x)\mathrm{d}x = n!.
  • p. 109,  zweiter Absatz muss lauten “… bei jedem Integral um einen Grenzwert…”.
  • p. 109 letzter Satz muss lauten “…und den Größen…”.
  • p. 110, Satz 16.27: Letzter Satz muss lauten “Ist \lambda(M)<\infty, so ist f\in L^1(\mathbb{R}).

Zur ergänzenden Vorlesung zur Analysis 2 für Physiker mit der knappen “Hands-on” Einführung in die Vektoranalysis gibt es hier ebenfalls das Skript:



I am at at IFIP TC7 and today I talked about the inertial primal-dual forward-backward method Tom Pock and I developed in this paper (find my slides here). I got a few interesting questions and one was about the heavy-ball method.

I used the heavy-ball method by Polyak as a motivation for the inertial primal-dual forward-backward method: To minimize a convex function {F}, Polyak proposed the heavy-ball method

\displaystyle y_k = x_k + \alpha_k(x_k-x_{k-1}),\qquad x_{k+1} = y_k - \lambda_k \nabla F(x_k) \ \ \ \ \ (1)

with appropriate step sizes {\lambda_k} and extrapolation factors {\alpha_k}. Polyaks motivation was as follows: The usual gradient descent {x_{k+1} = x_k - \lambda_k \nabla F(x_k)} can be seen as a discretization of the ODE {\dot x = -\nabla F(x)} and its comparably slow convergence comes from the fact that after discretization, the iterates starts to “zig-zag” in directions that do not point straight towards the minimizer. Adding “inertia” to the iteration should help to keep the method on track towards the solution. So he proposed to take the ODE {\gamma\ddot x + \dot x = -\nabla F(x)}, leading to his heavy ball method. After the talk, Peter Maaß asked me, if the heavy-ball method has an interpretation in a way that you do usual gradient descent but change to function in each iteration (somehow in the spirit of the conjugate gradient method). Indeed, one can do the following: Write the iteration as

\displaystyle  x_{k+1} = x_k - \lambda_k\Big[\tfrac{\alpha_k}{\lambda_k}(x_{k-1}-x_k) + \nabla F(x_k)\Big]

and then observe that this is

\displaystyle  x_{k+1} = x_k - \lambda_k \nabla G_k(x_k)


\displaystyle  G_k(x) = - \tfrac{\alpha_k}{2\lambda_k}\|x-x_{k-1}\|^2 + F(x).

Hence, you have indeed a perturbed gradient descent and the perturbation acts in a way, that it moves the minimizer of the objective a bit such that it lies more in the direction towards which you where heading anyway and, moreover, pushes you away from the previous iterate {x_{k-1}}. This nicely contrasts the original interpretation from~(1) in which one says that one takes the direction coming from the current gradient, but before going into this direction move a bit more in the direction where you were moving.

Today I learned that Ernie Esser passed away on March 8th, 2015 at age 34. Ernie was a PostDoc at UBC, Vancouver (PhD from UCLA) and worked in applied mathematics, especially imaging, optimization and inverse problems. I can’t say that I knew him very well – but we were in the similar research communities and I remember sitting with him in restaurants and chatting at several occasions (e.g. in Vancouver and Hong Kong…) and I remember him as a perticularly nice guy. Also I followed his work and it frequently happened that I found papers by him thinking “ok, that’s solved” crossing a problem from the list of thing I would have liked to work on.

I am seriously shocked to learn that he died from pneumonia, especially at his young age. There is a obituary weblog here and I am glad that his workgroup did not only set up this page but also plans to do a scientific event to recognize his work.