### Optimization

Do you remember free time as a kid that you wasted with connecting dots? If you actually liked it, here’s some good news: There are dot-to-dot books for grown-ups! Most notable, there are the books by Thomas Pavitte with great pictures with 1000 dots.

So, these are some of the books

and here is some video:

Actually, it takes some time to connect 1000 dots; I need ten minutes or so, depending a bit an the picture.

For the fun of it, I coded some lines in MATLAB to connect the dots automatically. And since I am a lazy programmer, I did not bother to connect the dots in the manner that was prescribed by the artist but more efficiently:

1. Greedy paths

For the greedy path, we start at some randomly chosen dot and connect the dot where we are with the closest possible dot where we haven’t been already.

Here’s how this looks for one of Thomas’ pictures:

2. Shortest paths

The greedy path sometimes makes very large jumps (when it runs into some corner, using all the dots in the vicinity). This leads to some spurious large jumps in the picture. In used some simple heuristics to find some “locally shortest paths” through the thousand dots. (And “locally shortest” means that there are no two edges for which swapping improves the total lengths of the paths.) Actually, I started out with the goal to solve the travelling salesman problem over the thousand dots, i.e., to find the shortest path of all. Then it turned out that

1. Solving the travelling salesman problem is not that simple to solve – well it’s one of the best investigated NP-hard optimization problems and there is no doubt that it would take my laptop only little time to solve it with 1000 dots if fed with the right code.
2. The locally shortest path already looks quite appealing and I am not sure how the shortest path would look any different.

Here is the locally shortest path:

Oh, by the way: the image is a pug dog, the one that you can party see on this cover

Here are some more pictures (not by Thomas Pavitte). Middle is the greedy, right the locally shortest path:

This slideshow requires JavaScript.

Since the Wikipedia page of the travelling salesman problem contains a formulation as an integer linear program to solve it, I may give it a try in the future…

Optimal transport in a discrete setting goes as follows: Consider two vectors ${p,q\in{\mathbb R}^N}$ with non-negative entries and ${\|p\|_1 = \|q\|_1}$. You may also say that ${p}$ and ${q}$ are two probability distributions or discrete measures with same total mass. A transport plan for ${p}$ and ${q}$ is a matrix ${\pi\in{\mathbb R}^{N\times N}}$ with non-negative entries such that

$\displaystyle \sum_i \pi_{i,j} = p_j,\quad\sum_j\pi_{i,j} = q_i.$

The interpretation of ${\pi}$ is, that ${\pi_{i,j}}$ indicates how much of the mass ${p_j}$ sitting in the ${j}$-th entry of ${p}$ is transported to the ${i}$-th entry of ${q}$. To speak about optimal transport we add an objective function to the constraints, namely, the cost ${c_{i,j}}$ that says how much it costs to transport one unit from the ${j}$-th entry in ${p}$ to the ${i}$-th entry in ${q}$. Then the optimal transport problem is

$\displaystyle \min_{\pi\in{\mathbb R}^{N\times N}} \sum_{i,j}c_{i,j}\pi_{i,j}\quad\text{s.t.}\quad\sum_i \pi_{i,j} = p_j,\quad\sum_j\pi_{i,j} = q_i.$

The resulting ${\pi}$ is an optimal transport plan and the resulting objective value is the minimal cost at which ${p}$ can be transported to ${q}$. In fact the minimization problem is a linear program and not only that, it’s one of the best studied linear programs and I am sure there is a lot that I don’t know about the structure of this linear program (you may have a look at these slides by Jesus De Loera to get an impression what is known about the structure of the linear program)

So it looks like discrete optimal transport is a fairly standard problem with standard solvers available. But all solvers have one severe drawback when it comes to large ${N}$: The optimization variable has ${N^2}$ entries. If ${N^2}$ is too large to store ${\pi}$ or keep ${\pi}$ in memory, it seems that there is not much one can do. This is the memory bottleneck for optimal transport problems.

1. Kantorovich-Rubinstein duality

In the case when the cost has the special form ${c_{i,j} = |i-j|}$ on can reduce to memory-burden. This special cost makes sense if the indices ${i}$ and ${j}$ correspond to spatial locations, since then the cost ${c_{i,j} = |i-j|}$ is just the distance from location ${i}$ to ${j}$. It turns out that in this case there is a simple dual optimal transport problem, namely

$\displaystyle \max_{f\in{\mathbb R}^N} f^T(p-q)\quad\text{s.t.}\quad |f_i-f_{i-1}|\leq 1,\ 2\leq i\leq N.$

(This is a simple form of the Kantorovich-Rubinstein duality and works similarly if the cost ${c}$ is any other metric on the set of indices.) The new optimization problem is still linear but the memory requirements is only ${N}$ and not ${N^2}$ anymore and moreover there are only ${O(N)}$ constraints for ${f}$. This idea is behind the method from the paper Imaging with Kantorovich-Rubinstein discrepancy by Jan Lellmann, myself, Carola Schönlieb and Tuomo Valkonen.

2. Entropic regularization

In this post I’d like to describe another method to break through the memory bottleneck of optimal transport. This method works for basically any cost ${c}$ but involves a little bit of smoothing/regularization.

We go from the linear program to a non-linear but still convex one by adding the negative entropy of the transport plan to the objective, i.e. we consider the objective

$\displaystyle \sum_{i,j}\Big[c_{i,j}\pi_{i,j} + \gamma \pi_{i,j}(\log(\pi_{i,j}) -1)\Big]$

for some ${\gamma>0}$.

What’s the point of doing so? Let’s look at the Lagrangian: For the constraints ${\sum_i \pi_{i,j} = p_j}$ and ${\sum_j\pi_{i,j} = q_i}$ we introduce the ones vector ${{\mathbf 1}\in{\mathbb R}^n}$, write them as ${\pi^T{\mathbf 1} = p}$ and ${\pi{\mathbf 1}=q}$, add Lagrange multipliers ${\alpha}$ and ${\beta}$ and get

$\begin{array}{rl}\mathcal{L}(\pi,\alpha,\beta) = & \sum_{i,j}\Big[c_{i,j}\pi_{i,j} + \pi_{i,j}(\log(\pi_{i,j}) -1)\Big]\\ & \quad+ \alpha^T(\pi^T{\mathbf 1}-p) + \beta^T(\pi{\mathbf 1}-q)\end{array}$

The cool thing happens with the optimality condition when deriving ${\mathcal{L}}$ with respect to ${\pi_{i,j}}$:

$\displaystyle \partial_{\pi_{i,j}}\mathcal{L} = c_{i,j} + \gamma\log(\pi_{i,j}) + \alpha_j + \beta_i \stackrel{!}{=} 0$

We can solve for ${\pi_{i,j}}$ and get

$\displaystyle \pi_{i,j} = \exp(-\tfrac{c_{i,j}}{\gamma})\exp(-\tfrac{\alpha_j}\gamma)\exp(-\tfrac{\beta_i}\gamma).$

What does that say? It says that the optimal ${\pi}$ is obtained from the matrix

$\displaystyle M_{i,j} = \exp(-\tfrac{c_{i,j}}{\gamma})$

with rows and columns rescaled by vectors ${u_j = \exp(-\tfrac{\alpha_j}\gamma)}$ and ${v_i = \exp(-\tfrac{\beta_i}\gamma)}$, respectively, i.e.

$\displaystyle \pi = \mathbf{diag}(v)M\mathbf{diag}(u).$

The reduces the memory requirement from ${N^2}$ to ${2N}$! The cost for doing so is the regularization by the entropy.

Actually, the existence of the vectors ${u}$ and ${v}$ also follows from Sinkhorn’s theorem which states that every matrix ${A}$ with positive entries can be written as ${A = D_1MD_2}$ with diagonal matrices and a doubly stochastic matrix ${M}$ (i.e. one with non-negative entries and unit row and column sums). The entropic regularization for the transport plan ensures that the entries of the transport plan has indeed positive (especially non-zero) entries.

But there is even more to the story:. To calculate the vectors ${u}$ and ${v}$ you simply have to do the following iteration:

$\displaystyle \begin{array}{rcl} u^{n+1} &=& \frac{p}{Mv^n}\\ v^{n+1} &=& \frac{q}{M^Tu^n} \end{array}$

where the fraction means element-wise division. Pretty simple.

What the iteration does in the first step is simply to take the actual ${v}$ and calculates a column scaling ${u}$ such that the column sums match ${p}$. In the second step it calculates the row scaling ${v}$ such that the row sums match ${q}$. This iteration is also known as Sinkhorn-Knopp algorithm.

This is pretty simple to code in MATLAB. Here is a simple code that does the above iteration (using $c_{i,j} = |i-j|^2$):

%Parameters
gamma = 10; %reg for entropy
maxiter = 100; % maxiter
map = colormap(gray);

N = 100; % size
x = linspace(0,1,N)';%spatial coordinate

% marginals
p = exp(-(x-0.2).^2*10^2) + exp(-abs(x-0.4)*20);p=p./sum(p); %for colums sum
q = exp(-(x-0.8).^2*10^2);q = q./sum(q); % for row sum

[i,j] = meshgrid(1:N);
M = exp(-(i-j).^2/gamma); % exp(-cost/gamma)

% intialize u and v
u = ones(N,1);v = ones(N,1);

% Sinkhorn-Knopp
% iteratively scale rows and columns
for k = 1:maxiter
% update u and v
u = p./(M*v);
v = q./(M'*u);
% assemble pi (only for illustration purposes)
pi = diag(v)*M*diag(u);
% display pi (with marginals on top and to the left)
imagesc([p'/max(p) 0;pi/max(pi(:)) q/max(q)])
colormap(1-map)
drawnow
end


Here are some results:

(From the left to the right: $\gamma=40,20,10,7$. The first row of pixels is ${p}$, the last column is ${q}$ and in between there is ${\pi}$, all things normalized such that black is the maximal value in ${p}$, ${q}$ and ${\pi}$, respectively.)

You see that for large ${\gamma}$, the plan is much more smooth and not so well localized as it should be for an optimal plan.

Oh, and here is an animation of 100 iterations of Sinkhorn-Knopp showing the result after both $u$ and $v$ have been updated:

There is a catch with this regularization: For small ${\gamma}$ (in this example about ${\gamma\leq 6}$) the method runs into problem with under- and overflow: the entries in ${Mv^n}$ and ${M^Tu^n}$ become very small. One can fight this effect a bit but I don’t have a convincing numerical method to deal with this, yet.It seems that the entries of the optimal $u$ and $v$ really have to be incredibly small and large, and I mean really small and large (in the order of $10^{300}$ and $10^{-300}$ in both $u$ and $v$).

While the Sinkhorn-Knopp algorithm is already from the 1960s, its application to optimal transport seems fairly new – I learned about from in talks by Gabriel Peyré and Bernhard Schmitzer and the reference is Sinkhorn Distances: Lightspeed Computation of Optimal Transport (presented at NIPS 2013) by Marco Cuturi.

I am at at IFIP TC7 and today I talked about the inertial primal-dual forward-backward method Tom Pock and I developed in this paper (find my slides here). I got a few interesting questions and one was about the heavy-ball method.

I used the heavy-ball method by Polyak as a motivation for the inertial primal-dual forward-backward method: To minimize a convex function ${F}$, Polyak proposed the heavy-ball method

$\displaystyle y_k = x_k + \alpha_k(x_k-x_{k-1}),\qquad x_{k+1} = y_k - \lambda_k \nabla F(x_k) \ \ \ \ \ (1)$

with appropriate step sizes ${\lambda_k}$ and extrapolation factors ${\alpha_k}$. Polyaks motivation was as follows: The usual gradient descent ${x_{k+1} = x_k - \lambda_k \nabla F(x_k)}$ can be seen as a discretization of the ODE ${\dot x = -\nabla F(x)}$ and its comparably slow convergence comes from the fact that after discretization, the iterates starts to “zig-zag” in directions that do not point straight towards the minimizer. Adding “inertia” to the iteration should help to keep the method on track towards the solution. So he proposed to take the ODE ${\gamma\ddot x + \dot x = -\nabla F(x)}$, leading to his heavy ball method. After the talk, Peter Maaß asked me, if the heavy-ball method has an interpretation in a way that you do usual gradient descent but change to function in each iteration (somehow in the spirit of the conjugate gradient method). Indeed, one can do the following: Write the iteration as

$\displaystyle x_{k+1} = x_k - \lambda_k\Big[\tfrac{\alpha_k}{\lambda_k}(x_{k-1}-x_k) + \nabla F(x_k)\Big]$

and then observe that this is

$\displaystyle x_{k+1} = x_k - \lambda_k \nabla G_k(x_k)$

with

$\displaystyle G_k(x) = - \tfrac{\alpha_k}{2\lambda_k}\|x-x_{k-1}\|^2 + F(x).$

Hence, you have indeed a perturbed gradient descent and the perturbation acts in a way, that it moves the minimizer of the objective a bit such that it lies more in the direction towards which you where heading anyway and, moreover, pushes you away from the previous iterate ${x_{k-1}}$. This nicely contrasts the original interpretation from~(1) in which one says that one takes the direction coming from the current gradient, but before going into this direction move a bit more in the direction where you were moving.

I am not an optimizer by training. My road to optimization went through convex analysis. I started with variational methods for inverse problems and mathematical imaging with the goal to derive properties of minimizers of convex functions. Hence, I studied a lot of convex analysis. Later I got interested in how to actually solve convex optimization problems and started to read books about (convex) optimization. At first I was always distracted by the way optimizers treated constraints. To me, a convex optimization problem always looks like

$\displaystyle \min_x F(x).$

Everything can be packed into the convex objective. If you have a convex objective ${f}$ and a constraint ${c(x) \leq 0}$ with a convex function ${c}$, just take ${F = f + I_{\{c\leq 0\}}}$, i.e., add the indicator function of the constraint to the objective (for some strange reason, Wikipedia has the name and notation for indicator and characteristic function the other way round than I, and many others…). . Similarly for multiple constraints ${c_i(x)\leq 0}$ or linear equality constraints ${Ax=b}$ and such.

In this simple world it is particularly easy to characterize all solutions of convex minimization problems: They are just those ${x}$ for which

$\displaystyle 0\in\partial F(x).$

Simple as that. Only take the subgradient of the objective and that’s it.

When reading the optimization books and seeing how difficult the treatment of constraints is there, I was especially puzzled how complicated optimality conditions such as KKT looked like in contrast to ${0\in\partial F(x)}$ and also and by the notion of constraint qualifications.

These constraint qualifications are additional assumptions that are needed to ensure that a minimizer ${x}$ fulfills the KKT-conditions. For example, if one has constraints ${c_i(x)\leq 0}$ then the linear independence constraint qualification (LICQ) states that all the gradients ${\nabla c_i(x)}$ for constraints that are “active” (i.e. ${c_i(x)=0}$) have to be linearly independent.

It took me while to realize that there is a similar issue in my simple “convex analysis view” on optimization: When passing from the gradient of a function to the subgradient, many things stay as they are. But not everything. One thing that does change is the simple sum-rule. If ${F}$ and ${G}$ are differentiable, then ${\nabla(F+G)(x) = \nabla F(x) + \nabla G(x)}$, always. That’s not true for subgradients! You always have that ${\partial F(x) + \partial G(x) \subset \partial(F+G)(x)}$. The reverse inclusion is not always true but holds, e.g., if there is some point for which ${G}$ is finite and ${F}$ is continuous. At first glance this sounds like a very weak assumption. But in fact, this is precisely in the spirit of constraint qualifications!

Take two constraints ${c_1(x)\leq 0}$ and ${c_2(x)\leq 0}$ with convex and differentiable ${c_{1/2}}$. We can express these by ${x\in K_i = \{x\ :\ c_i(x)\leq 0\}}$ (${i=1,2}$). Then it is equivalent to write

$\displaystyle \min_x f(x)\ \text{s.t.}\ c_i(x)\leq 0$

and

$\displaystyle \min_x (f + I_{K_1} + I_{K_2})(x).$

So characterizing solution to either of these is just saying that ${0 \in\partial (f + I_{K_1} + I_{K_2})(x)}$. Oh, there we are: Are we allowed to pull the subgradient apart? We need to apply the sum rule twice and at some point we need that there is a point at which ${I_{K_1}}$ is finite and the other one ${I_{K_2}}$ is continuous (or vice versa)! But an indicator function is only continuous in the interior of the set where it is finite. So the simplest form of the sum rule only holds in the case where only one of two constraints is active! Actually, the sum rule holds in many more cases but it is not always simple to find out if it really holds for some particular case.

So, constraint qualifications are indeed similar to rules that guarantee that a sum rule for subgradients holds.

Geometrically speaking, both shall guarantee that if one “looks at the constraints individually” one still can see what is going on at points of optimality. It may well be that the sum of individual subgradients is too small to get any points with ${0\in \partial F(x) + \partial I_{K_1}(x) + \partial I_{K_2}(x)}$ but still there are solutions to the optimization problem!

As a very simple illustration take the constraints ${K_1 = \{(x,y)\ :\ y\leq 0\}}$ and ${K_2 = \{(x,y)\ :\ y^2\geq x\}}$ in two dimensions. The first constraint says “be in the lower half-plane” while the second says “be above the parabola ${y^2=x}$”. Now take the point ${(0,0)}$ which is on the boundary for both sets. It’s simple to see (geometrically and algebraically) that ${\partial I_{K_1}(0,0) = \{(0,y)\ :\ y\geq 0\}}$ and ${\partial I_{K_2}(0,0) = \{(0,y)\ :\ y\leq 0\}}$, so treating the constraints individually gives ${\partial I_{K_1}(0,0) + \partial I_{K_2}(0,0) = \{(0,y)\ :\ y\in{\mathbb R}\}}$. But the full story is that ${K_1\cap K_2 = \{(0,0)\}}$, thus ${\partial(I_{K_1} + I_{K_2})(0,0) = \partial I_{K_1\cap K_2}(0,0) = {\mathbb R}^2}$ and consequently, the subgradient is much bigger.

The Douglas-Rachford method is a method to solve a monotone inclusion ${0\in (A+B)x}$ with two maximally monotone operators ${A,B}$ defined on a Hilbert space ${X}$. The method uses the resolvents ${(I + \lambda A)^{-1}}$ and ${(I + \lambda B)^{-1}}$ and produces two sequences of iterates

$\displaystyle \begin{array}{rcl} x^{k+1}& =& (I + \lambda B)^{-1}(v^k)\\ v^{k+1} & = & v^k + (I+\lambda A)^{-1}(2x^{k+1} - v^k) -x^{k+1}. \end{array}$

Looks pretty opaque to me and I did not have some good intuition where this methods comes from and why it should work. Here’s a way I can remember (and which is copied from “Preconditioned Douglas-Rachford Splitting Methods for Convex-Concave Saddle-Point Problems” by Hongpeng Sun and Kristian Bredies):

Substituting ${w = Ax}$ gives the optimality system

$\displaystyle 0 \in w + Bx,\qquad 0 \in -x + A^{-1} w$

or, written differently

$\displaystyle 0 \in \begin{bmatrix} B & I\\ -I & A^{-1} \end{bmatrix} \begin{bmatrix} x\\w \end{bmatrix}.$

This is again a monotone inclusion, but now on ${X\times X}$. We introduce the positive definite operator

$\displaystyle M = \begin{bmatrix} I & -I\\ -I & I \end{bmatrix}$

and perform the iteration

$\displaystyle (M + \begin{bmatrix} B & I\\ -I & A^{-1} \end{bmatrix}) \begin{bmatrix} x^{k+1}\\w^{k+1} \end{bmatrix} \ni M \begin{bmatrix} x^k\\w^k \end{bmatrix}.$

(This is basically the same as applying the proximal point method to the preconditioned inclusion

$\displaystyle 0\in M^{-1} \begin{bmatrix} B & I\\ -I & A^{-1} \end{bmatrix} \begin{bmatrix} x\\w \end{bmatrix}.)$

Writing out the iteration gives

$\displaystyle \begin{array}{rcl} x^{k+1} & = &(I + B)^{-1}(x^k - w^k)\\ w^{k+1} & = &(I + A^{-1})^{-1}(w^k + 2x^{k+1} - x^k). \end{array}$

Now, applying the Moreau identity for monotone operators (${(I + A)^{-1} + (I+A^{-1})^{-1} = I}$), gives

$\displaystyle \begin{array}{rcl} x^{k+1} & = &(I + B)^{-1}(x^k - w^k)\\ w^{k+1} & = &w^k + 2x^{k+1} - x^k - (I + A)^{-1}(w^k + 2x^{k+1} - x^k) \end{array}$

substituting ${v^k = x^k - w^k}$ finally gives Douglas-Rachford:

$\displaystyle \begin{array}{rcl} x^{k+1} & = &(I + B)^{-1}(v^k)\\ v^{k+1} & = & -x^{k+1} + v^k + (I + A)^{-1}(2x^{k+1} - v^k) \end{array}$

(besides the stepsize ${\lambda}$ which we would get by starting with the equivalent inclusion ${0 \in \lambda(A+B)x}$ in the first place).

Probably the shortest derivation of Douglas-Rachford I have seen. Oh, and also the (weak) convergence proof comes for free: It’s a proximal point iteration and you just use the result by Rockafellar from “Monotone operators and the proximal point algorithm”, SIAM J. Control and Optimization 14(5), 1976.

Currently I am at the SIAM Imaging conference in Hong Kong. It’s a great conference with great people at a great place. I am pretty sure that this will be the only post from here, since the conference is quite intense. I just wanted to report on two ideas that have become clear here, although, they are both pretty easy and probably already widely known, but anyway:

1. Non-convex + convex objective

There are a lot of talks that deal with optimization problems of the form

$\displaystyle \min_u F(u) + G(u).$

Especially, people try to leverage as much structure of the functionals ${F}$ and ${G}$ as possible. Frequently, there arises a need to deal with non-convex parts of the objective, and indeed, there are several approaches around that deal in one way or another with non-convexity of ${F}$ or even ${F+G}$. Usually, in the presence of an ${F}$ that is not convex, it is helpful if ${G}$ has favorable properties, e.g. that still ${F+G}$ is bounded from below, coercive or even convex again. A particularly helpful property is strong convexity of ${G}$ (i.e. ${G}$ stays convex even if you subtract ${\epsilon/2\|\cdot\|^2}$ from it). Here comes the simple idea: If you already allow ${F}$ to be non-convex, but only have a ${G}$ that is merely convex, but not strongly so, you can modify your objective to

$\displaystyle \underbrace{F(u) - \tfrac\epsilon2\|u\|^2}_{\leftarrow F(u)} + \underbrace{G(u) + \tfrac\epsilon2\|u\|^2}_{\leftarrow G(u)}$

for some ${\epsilon>0}$. This will give you strong convexity of ${G}$ and an ${F}$ that is (often) theoretically no worse than it used to be. It appeared to me that this is an idea that Kristian Bredies told me already almost ten years ago and which me made into a paper (together with Peter Maaß) in 2005 which got somehow delayed and published no earlier than 2009.

If your problem has the form

$\displaystyle \min_u F(u) + G(Ku)$

with some linear operator ${K}$ and both ${F}$ and ${G}$ are convex, it has turned out, that it is tremendously helpful for the solution to consider the corresponding saddle point formulation: I.e. using the convex conjugate ${G^*}$ of ${G}$, you write

$\displaystyle \min_u \max_v F(u) + \langle Ku, v\rangle -G^*(v).$

A class of algorithms, that looks like to Arrow-Hurwicz-method at first glance, has been sparked be the method proposed by Chambolle and Pock. This method allows ${F}$ and ${G}$ to be merely convex (no smoothness or strong convexity needed) and only needs the proximal operators for both ${F}$ and ${G^*}$. I also worked on algorithms for slightly more general problems, involving a reformulation of the saddle point problem as a monotone inclusion, with Tom Pock in the paper An accelerated forward-backward algorithm for monotone inclusions and I also should mention this nice approach by Bredies and Sun who consider another reformulation of the monotone inclusion. However, in the spirit of the first point, one should take advantage of all the available structure in the problem, e.g. smoothness of one of the terms. Some algorithm can exploit smoothness of either ${F}$ or ${G^*}$ and only need convexity of the other term. An idea, that has been used for some time already, to tackle the case if ${F}$, say, is a sum of a smooth part and a non-smooth part (and ${G^*}$ is not smooth), is, to dualize the non-smooth part of ${F}$: Say we have ${F = F_1 + F_2}$ with smooth ${F_1}$, then you could write

$\displaystyle \begin{array}{rcl} &\min_u\max_v F_1(u) + F_2(u) + \langle Ku, v\rangle -G^*(v)\\ & \qquad= \max_u \min_{v,w} F_1(u) + \langle u,w\rangle + \langle Ku, v\rangle -G^*(v) - F_2^*(w) \end{array}$

and you are back in business, if your method allows for sums of convex functions in the dual. The trick got the sticky name “dual transportation trick” in a talk by Marc Teboulle here and probably that will help, that I will not forget it from now on…

I fell a little bit behind on reporting on my new preprints. In this posts I’ll blog on two closely related ones; one of them already a bit old, the other one quite recent:

The papers are

As clear from the titles, both papers treat a similar method. The first paper contains all the theory and the second one has few particularly interesting applications.

In the first paper we propose to view several known algorithms such as the linearized Bregman method, the Kaczmarz method or the Landweber method from a different angle from which they all are special cases of another algorithm. To start with, consider a linear system

$\displaystyle Ax=b$

with ${A\in{\mathbb R}^{m\times n}}$. A fairly simple and old method to solve this, is the Landweber iteration which is

$\displaystyle x^{k+1} = x^k - t_k A^T(Ax^k-b).$

Obviously, this is nothing else than a gradient descent for the functional ${\|Ax-b\|_2^2}$ and indeed converges to a minimizer of this functional (i.e. a least squares solution) if the stepsizes ${t_k}$ fulfill ${\epsilon\leq t_k\leq 2\|A\|^{-2} - \epsilon}$ for some ${\epsilon>0}$. If one initializes the method with ${x^0=0}$ it converges to the least squares solution with minimal norm, i.e. to ${A^\dag b}$ (with the pseudo-inverse ${A^\dag}$).

A totally different method is even older: The Kaczmarz method. Denoting by ${a_k}$ the ${k}$-th row of ${A}$ and ${b_k}$ the ${k}$-th entry of ${b}$ the method reads as

$\displaystyle x^{k+1} = x^k - a_{r(k)}^T\frac{a_{r(k)}\cdot x^k - b_k}{\|a_{r(k)}\|_2^2}$

where ${r(k) = (k\mod m) +1}$ or any other “control sequence” that picks up every index infinitely often. This method also has a simple interpretation: Each equation ${a_k\cdot x = b_k}$ describes a hyperplane in ${{\mathbb R}^n}$. The method does nothing else than projecting the iterates orthogonally onto the hyperplanes in an iterative manner. In the case that the system has a solution, the method converges to one, and if it is initialized with ${x^0=0}$ we have again convergence to the minimum norm solution ${A^\dag b}$.

There is yet another method that solves ${Ax=b}$ (but now it’s a bit more recent): The iteration produces two sequences of iterates

$\displaystyle \begin{array}{rcl} z^{k+1} & = &z^k - t_k A^T(Ax^k - b)\\ x^{k+1} & = &S_\lambda(z^{k+1}) \end{array}$

for some ${\lambda>0}$, the soft-thresholding function ${S_\lambda(x) = \max(|x|-\lambda,0)\mathrm{sgn}(x)}$ and some stepsize ${t_k}$. For reasons I will not detail here, this is called the linearized Bregman method. It also converges to a solution of the system. The method is remarkably similar, but different from, the Landweber iteration (if the soft-thresholding function wouldn’t be there, both would be the same). It converges to the solution of ${Ax=b}$ that has the minimum value for the functional ${J(x) = \lambda\|x\|_1 + \tfrac12\|x\|_2^2}$. Since this solution of close, and for ${\lambda}$ large enough identical, to the minimum ${\|\cdot\|_1}$ solution, the linearized Bregman method is a method for sparse reconstruction and applied in compressed sensing.

Now we put all three methods in a joint framework, and this is the framework of split feasibility problems (SFP). An SFP is a special case of a convex feasibility problems where one wants to find a point ${x}$ in the intersection of multiple simple convex sets. In an SFP one has two different kinds of convex constraints (which I will call “simple” and “difficult” in the following):

1. Constraints that just demand that ${x\in C_i}$ for some convex sets ${C_i}$. I call these constraints “simple” because we assume that the projection onto each ${C_i}$ is simple to obtain.
2. Constraints that demand ${A_ix\in Q_i}$ for some matrices ${A_i}$ and simple convex sets ${Q_i}$. Although we assume that projections onto the ${Q_i}$ are easy, these constraints are “difficult” because of the presence of the matrices ${A_i}$.

If there were only simple constraints a very basic method to solve the problem is the methods of alternating projections, also known as POCS (projection onto convex sets): Simply project onto all the sets ${C_i}$ in an iterative manner. For difficult constraints, one can do the following: Construct a hyperplane ${H_k}$ that separates the current iterate ${x^k}$ from the set defined by the constraint ${Ax\in Q}$ and project onto the hyperplane. Since projections onto hyperplanes are simple and since the hyperplane separates we move closer to the constraint set and this is a reasonable step to take. One such separating hyperplane is given as follows: For ${x^k}$ compute ${w^k = Ax^k-P_Q(Ax^k)}$ (with the orthogonal projection ${P_Q}$) and define

$\displaystyle H_k = \{x\ : (A^Tw^k)^T\cdot x \leq (A^Tw^k)^T\cdot x^k - \|w^k\|_2^2\}.$

Illustration of projections onto convex sets and separating hyperplanes

Now we already can unite the Landweber iteration and the Kaczmarz method as follows: Consider the system ${Ax=b}$ as a split feasibility problem in two different ways:

1. Treat ${Ax=b}$ as one single difficult constraint (i.e. set ${Q=\{b\}}$). Some calculations show that the above proposed method leads to the Landweber iteration (with a special stepsize).
2. Treat ${Ax=b}$ as ${m}$ simple constraints ${a_i\cdot x = b_i}$. Again, some calculations show that this gives the Kaczmarz method.

Of course, one could also work “block-wise” and consider groups of equations as difficult constraints to obtain “block-Kaczmarz methods”.

Now comes the last twist: By adapting the term of “projection” one gets more methods. Particularly interesting is the notion of Bregman projections which comes from Bregman distances. I will not go into detail here, but Bregman distances are associated to convex functionals ${J}$ and by replacing “projection onto ${C_i}$ or hyperplanes” by respective Bregman projections, one gets another method for split feasibility problems. The two things I found remarkable:

• The Bregman projection onto hyperplanes is pretty simple. To project some ${x^k}$ onto the hyperplane ${H = \{x\ :\ a^T\cdot x\leq \beta\}}$, one needs a subgradient ${z^k\in\partial J(x^k)}$ (in fact an “admissible one” but for that detail see the paper) and then performs

$\displaystyle x^{k+1} = \nabla J^*(z^k - t_k a)$

(${J^*}$ is the convex dual of ${J}$) with some appropriate stepsize ${t_k}$ (which is the solution of a one-dimensional convex minimization problem). Moreover, ${z^{k+1} = z^k - t_k a}$ is a new admissible subgradient at ${x^{k+1}}$.

• If one has a problem with a constraint ${Ax=b}$ (formulated as an SFP in one way or another) the method converges to the minimum-${J}$ solution of the equation if ${J}$ is strongly convex.

Note that strong convexity of ${J}$ implies differentiability of ${J^*}$ and Lipschitz continuity of ${\nabla J}$ and hence, the Bregman projection can indeed be carried out.

Now one already sees how this relates to the linearized Bregman method: Setting ${J(x) = \lambda\|x\|_1 + \tfrac12\|x\|_2^2}$, a little calculation shows that

$\displaystyle \nabla J^*(z) = S_\lambda(z).$

Hence, using the formulation with a “single difficult constraint” leads to the linearized Bregman method with a specific stepsize. It turns out that this stepsize is a pretty good one but also that one can show that a constant stepsize also works as long as it is positive and smaller that ${2\|A\|^{-2}}$.

In the paper we present several examples how one can use the framework. I see one strengths of this approach that one can add convex constraints to a given problem without getting into any trouble with the algorithmic framework.

The second paper extends a remark that we make in the first one: If one applies the framework of the linearized Bregman method to the case in which one considers the system ${Ax=b}$ as ${m}$ simple (hyperplane-)constraints one obtains a sparse Kaczmarz solver. Indeed one can use the simple iteration

$\displaystyle \begin{array}{rcl} z^{k+1} & = &z^k - a_{r(k)}^T\frac{a_{r(k)}\cdot x^k - b_k}{\|a_{r(k)}\|_2^2}\\ x^{k+1} & = &S_\lambda(z^{k+1}) \end{array}$

and will converge to the same sparse solution as the linearized Bregman method.

This method has a nice application to “online compressed sensing”: We illustrate this in the paper with an example from radio interferometry. There, large arrays of radio telescopes collect radio emissions from the sky. Each pair of telescopes lead to a single measurement of the Fourier transform of the quantity of interest. Hence, for ${k}$ telescopes, each measurement gives ${k(k-1)/2}$ samples in the Fourier domain. In our example we used data from the Very Large Array telescope which has 27 telescopes leading to 351 Fourier samples. That’s not much, if one want a picture of the emission with several ten thousands of pixels. But the good thing is that the Earth rotates (that’s good for several reasons): When the Earth rotates relative to the sky, the sampling pattern also rotates. Hence, one waits a small amount of time and makes another measurement. Commonly, this is done until the earth has made a half rotation, i.e. one complete measurement takes 12 hours. With the “online compressed sensing” framework we proposed, one can start reconstructing the image as soon the first measurements have arrived. Interestingly, one observes the following behavior: If one monitors the residual of the equation, it goes down during iterations and jumps up when new measurements arrive. But from some point on, the residual stays small! This says that the new measurements do not contradict the previous ones and more interestingly this happened precisely when the reconstruction error dropped down such that “exact reconstruction” in the sense of compressed sensing has happened. In the example of radio interferometry, this happened after 2.5 hours!

Reconstruction by online compressed sensing

You can find slides of a talk I gave at the Sparse Tomo Days here.

Next Page »