I am at at IFIP TC7 and today I talked about the inertial primal-dual forward-backward method Tom Pock and I developed in this paper (find my slides here). I got a few interesting questions and one was about the heavy-ball method.

I used the heavy-ball method by Polyak as a motivation for the inertial primal-dual forward-backward method: To minimize a convex function ${F}$, Polyak proposed the heavy-ball method

$\displaystyle y_k = x_k + \alpha_k(x_k-x_{k-1}),\qquad x_{k+1} = y_k - \lambda_k \nabla F(x_k) \ \ \ \ \ (1)$

with appropriate step sizes ${\lambda_k}$ and extrapolation factors ${\alpha_k}$. Polyaks motivation was as follows: The usual gradient descent ${x_{k+1} = x_k - \lambda_k \nabla F(x_k)}$ can be seen as a discretization of the ODE ${\dot x = -\nabla F(x)}$ and its comparably slow convergence comes from the fact that after discretization, the iterates starts to “zig-zag” in directions that do not point straight towards the minimizer. Adding “inertia” to the iteration should help to keep the method on track towards the solution. So he proposed to take the ODE ${\gamma\ddot x + \dot x = -\nabla F(x)}$, leading to his heavy ball method. After the talk, Peter Maaß asked me, if the heavy-ball method has an interpretation in a way that you do usual gradient descent but change to function in each iteration (somehow in the spirit of the conjugate gradient method). Indeed, one can do the following: Write the iteration as

$\displaystyle x_{k+1} = x_k - \lambda_k\Big[\tfrac{\alpha_k}{\lambda_k}(x_{k-1}-x_k) + \nabla F(x_k)\Big]$

and then observe that this is

$\displaystyle x_{k+1} = x_k - \lambda_k \nabla G_k(x_k)$

with

$\displaystyle G_k(x) = - \tfrac{\alpha_k}{2\lambda_k}\|x-x_{k-1}\|^2 + F(x).$

Hence, you have indeed a perturbed gradient descent and the perturbation acts in a way, that it moves the minimizer of the objective a bit such that it lies more in the direction towards which you where heading anyway and, moreover, pushes you away from the previous iterate ${x_{k-1}}$. This nicely contrasts the original interpretation from~(1) in which one says that one takes the direction coming from the current gradient, but before going into this direction move a bit more in the direction where you were moving.

Today I learned that Ernie Esser passed away on March 8th, 2015 at age 34. Ernie was a PostDoc at UBC, Vancouver (PhD from UCLA) and worked in applied mathematics, especially imaging, optimization and inverse problems. I can’t say that I knew him very well – but we were in the similar research communities and I remember sitting with him in restaurants and chatting at several occasions (e.g. in Vancouver and Hong Kong…) and I remember him as a perticularly nice guy. Also I followed his work and it frequently happened that I found papers by him thinking “ok, that’s solved” crossing a problem from the list of thing I would have liked to work on.

I am seriously shocked to learn that he died from pneumonia, especially at his young age. There is a obituary weblog here and I am glad that his workgroup did not only set up this page but also plans to do a scientific event to recognize his work.

I am not an optimizer by training. My road to optimization went through convex analysis. I started with variational methods for inverse problems and mathematical imaging with the goal to derive properties of minimizers of convex functions. Hence, I studied a lot of convex analysis. Later I got interested in how to actually solve convex optimization problems and started to read books about (convex) optimization. At first I was always distracted by the way optimizers treated constraints. To me, a convex optimization problem always looks like

$\displaystyle \min_x F(x).$

Everything can be packed into the convex objective. If you have a convex objective ${f}$ and a constraint ${c(x) \leq 0}$ with a convex function ${c}$, just take ${F = f + I_{\{c\leq 0\}}}$, i.e., add the indicator function of the constraint to the objective (for some strange reason, Wikipedia has the name and notation for indicator and characteristic function the other way round than I, and many others…). . Similarly for multiple constraints ${c_i(x)\leq 0}$ or linear equality constraints ${Ax=b}$ and such.

In this simple world it is particularly easy to characterize all solutions of convex minimization problems: They are just those ${x}$ for which

$\displaystyle 0\in\partial F(x).$

Simple as that. Only take the subgradient of the objective and that’s it.

When reading the optimization books and seeing how difficult the treatment of constraints is there, I was especially puzzled how complicated optimality conditions such as KKT looked like in contrast to ${0\in\partial F(x)}$ and also and by the notion of constraint qualifications.

These constraint qualifications are additional assumptions that are needed to ensure that a minimizer ${x}$ fulfills the KKT-conditions. For example, if one has constraints ${c_i(x)\leq 0}$ then the linear independence constraint qualification (LICQ) states that all the gradients ${\nabla c_i(x)}$ for constraints that are “active” (i.e. ${c_i(x)=0}$) have to be linearly independent.

It took me while to realize that there is a similar issue in my simple “convex analysis view” on optimization: When passing from the gradient of a function to the subgradient, many things stay as they are. But not everything. One thing that does change is the simple sum-rule. If ${F}$ and ${G}$ are differentiable, then ${\nabla(F+G)(x) = \nabla F(x) + \nabla G(x)}$, always. That’s not true for subgradients! You always have that ${\partial F(x) + \partial G(x) \subset \partial(F+G)(x)}$. The reverse inclusion is not always true but holds, e.g., if there is some point for which ${G}$ is finite and ${F}$ is continuous. At first glance this sounds like a very weak assumption. But in fact, this is precisely in the spirit of constraint qualifications!

Take two constraints ${c_1(x)\leq 0}$ and ${c_2(x)\leq 0}$ with convex and differentiable ${c_{1/2}}$. We can express these by ${x\in K_i = \{x\ :\ c_i(x)\leq 0\}}$ (${i=1,2}$). Then it is equivalent to write

$\displaystyle \min_x f(x)\ \text{s.t.}\ c_i(x)\leq 0$

and

$\displaystyle \min_x (f + I_{K_1} + I_{K_2})(x).$

So characterizing solution to either of these is just saying that ${0 \in\partial (f + I_{K_1} + I_{K_2})(x)}$. Oh, there we are: Are we allowed to pull the subgradient apart? We need to apply the sum rule twice and at some point we need that there is a point at which ${I_{K_1}}$ is finite and the other one ${I_{K_2}}$ is continuous (or vice versa)! But an indicator function is only continuous in the interior of the set where it is finite. So the simplest form of the sum rule only holds in the case where only one of two constraints is active! Actually, the sum rule holds in many more cases but it is not always simple to find out if it really holds for some particular case.

So, constraint qualifications are indeed similar to rules that guarantee that a sum rule for subgradients holds.

Geometrically speaking, both shall guarantee that if one “looks at the constraints individually” one still can see what is going on at points of optimality. It may well be that the sum of individual subgradients is too small to get any points with ${0\in \partial F(x) + \partial I_{K_1}(x) + \partial I_{K_2}(x)}$ but still there are solutions to the optimization problem!

As a very simple illustration take the constraints ${K_1 = \{(x,y)\ :\ y\leq 0\}}$ and ${K_2 = \{(x,y)\ :\ y^2\geq x\}}$ in two dimensions. The first constraint says “be in the lower half-plane” while the second says “be above the parabola ${y^2=x}$”. Now take the point ${(0,0)}$ which is on the boundary for both sets. It’s simple to see (geometrically and algebraically) that ${\partial I_{K_1}(0,0) = \{(0,y)\ :\ y\geq 0\}}$ and ${\partial I_{K_2}(0,0) = \{(0,y)\ :\ y\leq 0\}}$, so treating the constraints individually gives ${\partial I_{K_1}(0,0) + \partial I_{K_2}(0,0) = \{(0,y)\ :\ y\in{\mathbb R}\}}$. But the full story is that ${K_1\cap K_2 = \{(0,0)\}}$, thus ${\partial(I_{K_1} + I_{K_2})(0,0) = \partial I_{K_1\cap K_2}(0,0) = {\mathbb R}^2}$ and consequently, the subgradient is much bigger.

In my Analysis class today I defined the trigonometric functions ${\sin}$ and ${\cos}$ by means of the complex exponential. As usual I noted that for real ${x}$ we have ${|\mathrm{e}^{\mathrm{i} x}| = 1}$, i.e. ${\mathrm{e}^{\mathrm{i} x}}$ lies on the complex unit circle. Then I drew the following picture:

This was meant to show that the real part and the imaginary part of ${\mathrm{e}^{\mathrm{i} x}}$ are what is known as ${\cos(x)}$ and ${\sin(x)}$, respectively.

After the lecture a student came to me and noted that we could have started with ${a>1}$ and note that ${|a^{\mathrm{i} x}|=1}$ and could do the same thing. The question is: Does this work out? My initial reaction was: Yeah, that works, but you’ll get a different ${\pi}$

But then I wondered, if this would lead to something useful. At least for the logarithm one does a similar thing. We define ${a^x}$ for ${a>0}$ and real ${x}$ as ${a^x = \exp_a(x) = \exp(x\ln(a))}$, notes that this gives a bijection between ${{\mathbb R}}$ and ${]0,\infty[}$ and defines the inverse function as

$\displaystyle \log_a = \exp_a^{-1}.$

So, nothing stops us from defining

$\displaystyle \cos_a(x) = \Re(a^{\mathrm{i} x}),\qquad \sin_a(x) = \Im(a^{\mathrm{i} x}).$

Many identities are still valid, e.g.

$\displaystyle \sin_a(x)^2 + \cos_a(x)^2 = 1$

or

$\displaystyle \cos_a(x)^2 = \tfrac12(1 + \cos_a(2x)).$

For the derivative one has to be a bit more careful as it holds

$\displaystyle \sin_a'(x) = \ln(a)\cos_a(x),\qquad \cos_a'(x) = -\ln(a)\sin_a(x).$

Coming back to “you’ll get a different ${\pi}$”: In the next lecture I am going to define ${\pi}$ by saying that ${\pi/2}$ is the smallest positive root of the functions ${\cos}$. Naturally this leads to a definition of “${\pi}$ in base ${a}$” as follows:

Definition 1 ${\pi_a/2}$ is the smallest positive root of ${\cos_a}$.

How is this related to the area of the unit circle (which is another definition for ${\pi}$)?

The usual analysis proof goes by calculating the area of a quarter the unit circle by integral ${\int_0^1 \sqrt{1-x^2} dx}$.

Doing this in base ${a}$ goes by substituting ${x = \sin_a(\theta)}$:

$\displaystyle \begin{array}{rcl} \int\limits_0^1\sqrt{1-x^2}\, dx & = & \int\limits_0^{\pi_a/2}\sqrt{1-\sin_a(\theta)^2}\, \ln(a)\cos_a(\theta)\, d\theta\\ & = & \ln(a) \int\limits_0^{\pi_a/2} \cos_a(\theta)^2\, d\theta\\ & = & \ln(a) \frac12 \int\limits_0^{\pi_a/2}(1 + \cos_a(2\theta))\, d\theta\\ & = & \frac{\ln(a)}{2} \Big( \frac{\pi_a}{2} + \int\limits_0^{\pi_a/2}\cos_a(2\theta)\, d\theta\\ & = & \frac{\ln(a)\pi_a}{4} + 0. \end{array}$

Thus, the area of the unit circle is now ${\ln(a)\pi_a}$

Oh, and by the way, you’ll get the nice identity

$\displaystyle \pi_{\mathrm{e}^\pi} = 1$

(and hence, the area of the unit circle is indeed ${\ln(\mathrm{e}^\pi)\pi_{\mathrm{e}^\pi} = \pi}$)…

In another blogpost I wrote about convexity from an abstract point of view. Recall, that convex functions ${f:X\rightarrow Y}$ can be defined as soon as we have a real linear structure on ${X}$ and an order on ${Y}$ as this allows to formulate the basic requirement for a convex function, namely that for all ${x,y\in X }$ and ${0\leq\lambda\leq 1}$ it holds that

$\displaystyle f(\lambda x + (1-\lambda)y)\leq \lambda f(x) + (1-\lambda)f(y).$

One amazing thing about convexity is, that it implies some regularity for the function. Indeed, you’ll find something on the net if you search for “convexity implies continuity”. But wait. How can that be? We have a mapping ${f}$ from a vector space ${X}$ to some ordered space ${Y}$ (which I will always assume to be ${{\mathbb R}\cup\{\infty\}}$ here, i.e. the extended real line) and we did not specify any topology on ${X}$ (while the extended real line carries its usual order topology). Indeed, one can equip a vector space with a lot of different topologies so how can it be that some property like convexity, which is expressed in purely algebraical terms, implies something like continuity, which is topological property? The answer is, that it is not really true that “convexity implies continuity”. The correct statement is a bit more subtle:

A convex function is Lipschitz continuous at any point where it is locally bounded.

Ok, here we have something more: We need boundedness of ${f}$, but this is still related to ${Y}$ and not related to ${X}$. But there is this little word “locally” and this is the point where some topology on ${X}$ comes into play. Let’s assume that we have even a metric on ${X}$ so that we can talk about balls. Then, the statement reads as:

A convex function ${f}$ is Lipschitz continuous at a point ${x}$ if there exists a ${C>0}$ and ${r>0}$ such that ${|f(y)|\leq C}$ for ${y\in B_r(x)}$.

Put differently: The continuity of a convex function ${f}$ depends on the boundedness of ${f}$ on neighborhoods. Consequently, if we change the topology, we change the set of neighborhoods and hence, a fixed convex function may have different continuity behavior in different topologies. This does indeed happen. Consider the following extreme example: Let ${x_0\in X}$ and

$\displaystyle f(x) = \begin{cases} 0 & x=x_0\\ \infty & \text{else.} \end{cases}$

This function is convex but, for the norm-topology, not continuous at any point. Also, it is not locally bounded at any point. However, if we change the topology such that each point is its own neighborhood (that is, we take the discrete metric), than we get local boundedness and also continuity of ${f}$.

The Douglas-Rachford method is a method to solve a monotone inclusion ${0\in (A+B)x}$ with two maximally monotone operators ${A,B}$ defined on a Hilbert space ${X}$. The method uses the resolvents ${(I + \lambda A)^{-1}}$ and ${(I + \lambda B)^{-1}}$ and produces two sequences of iterates

$\displaystyle \begin{array}{rcl} x^{k+1}& =& (I + \lambda B)^{-1}(v^k)\\ v^{k+1} & = & v^k + (I+\lambda A)^{-1}(2x^{k+1} - v^k) -x^{k+1}. \end{array}$

Looks pretty opaque to me and I did not have some good intuition where this methods comes from and why it should work. Here’s a way I can remember (and which is copied from “Preconditioned Douglas-Rachford Splitting Methods for Convex-Concave Saddle-Point Problems” by Hongpeng Sun and Kristian Bredies):

Substituting ${w = Ax}$ gives the optimality system

$\displaystyle 0 \in w + Bx,\qquad 0 \in -x + A^{-1} w$

or, written differently

$\displaystyle 0 \in \begin{bmatrix} B & I\\ -I & A^{-1} \end{bmatrix} \begin{bmatrix} x\\w \end{bmatrix}.$

This is again a monotone inclusion, but now on ${X\times X}$. We introduce the positive definite operator

$\displaystyle M = \begin{bmatrix} I & -I\\ -I & I \end{bmatrix}$

and perform the iteration

$\displaystyle (M + \begin{bmatrix} B & I\\ -I & A^{-1} \end{bmatrix}) \begin{bmatrix} x^{k+1}\\w^{k+1} \end{bmatrix} \ni M \begin{bmatrix} x^k\\w^k \end{bmatrix}.$

(This is basically the same as applying the proximal point method to the preconditioned inclusion

$\displaystyle 0\in M^{-1} \begin{bmatrix} B & I\\ -I & A^{-1} \end{bmatrix} \begin{bmatrix} x\\w \end{bmatrix}.)$

Writing out the iteration gives

$\displaystyle \begin{array}{rcl} x^{k+1} & = &(I + B)^{-1}(x^k - w^k)\\ w^{k+1} & = &(I + A^{-1})^{-1}(w^k + 2x^{k+1} - x^k). \end{array}$

Now, applying the Moreau identity for monotone operators (${(I + A)^{-1} + (I+A^{-1})^{-1} = I}$), gives

$\displaystyle \begin{array}{rcl} x^{k+1} & = &(I + B)^{-1}(x^k - w^k)\\ w^{k+1} & = &w^k + 2x^{k+1} - x^k - (I + A)^{-1}(w^k + 2x^{k+1} - x^k) \end{array}$

substituting ${v^k = x^k - w^k}$ finally gives Douglas-Rachford:

$\displaystyle \begin{array}{rcl} x^{k+1} & = &(I + B)^{-1}(v^k)\\ v^{k+1} & = & -x^{k+1} + v^k + (I + A)^{-1}(2x^{k+1} - v^k) \end{array}$

(besides the stepsize ${\lambda}$ which we would get by starting with the equivalent inclusion ${0 \in \lambda(A+B)x}$ in the first place).

Probably the shortest derivation of Douglas-Rachford I have seen. Oh, and also the (weak) convergence proof comes for free: It’s a proximal point iteration and you just use the result by Rockafellar from “Monotone operators and the proximal point algorithm”, SIAM J. Control and Optimization 14(5), 1976.

Currently I am at the SIAM Imaging conference in Hong Kong. It’s a great conference with great people at a great place. I am pretty sure that this will be the only post from here, since the conference is quite intense. I just wanted to report on two ideas that have become clear here, although, they are both pretty easy and probably already widely known, but anyway:

1. Non-convex + convex objective

There are a lot of talks that deal with optimization problems of the form

$\displaystyle \min_u F(u) + G(u).$

Especially, people try to leverage as much structure of the functionals ${F}$ and ${G}$ as possible. Frequently, there arises a need to deal with non-convex parts of the objective, and indeed, there are several approaches around that deal in one way or another with non-convexity of ${F}$ or even ${F+G}$. Usually, in the presence of an ${F}$ that is not convex, it is helpful if ${G}$ has favorable properties, e.g. that still ${F+G}$ is bounded from below, coercive or even convex again. A particularly helpful property is strong convexity of ${G}$ (i.e. ${G}$ stays convex even if you subtract ${\epsilon/2\|\cdot\|^2}$ from it). Here comes the simple idea: If you already allow ${F}$ to be non-convex, but only have a ${G}$ that is merely convex, but not strongly so, you can modify your objective to

$\displaystyle \underbrace{F(u) - \tfrac\epsilon2\|u\|^2}_{\leftarrow F(u)} + \underbrace{G(u) + \tfrac\epsilon2\|u\|^2}_{\leftarrow G(u)}$

for some ${\epsilon>0}$. This will give you strong convexity of ${G}$ and an ${F}$ that is (often) theoretically no worse than it used to be. It appeared to me that this is an idea that Kristian Bredies told me already almost ten years ago and which me made into a paper (together with Peter Maaß) in 2005 which got somehow delayed and published no earlier than 2009.

If your problem has the form

$\displaystyle \min_u F(u) + G(Ku)$

with some linear operator ${K}$ and both ${F}$ and ${G}$ are convex, it has turned out, that it is tremendously helpful for the solution to consider the corresponding saddle point formulation: I.e. using the convex conjugate ${G^*}$ of ${G}$, you write

$\displaystyle \min_u \max_v F(u) + \langle Ku, v\rangle -G^*(v).$

A class of algorithms, that looks like to Arrow-Hurwicz-method at first glance, has been sparked be the method proposed by Chambolle and Pock. This method allows ${F}$ and ${G}$ to be merely convex (no smoothness or strong convexity needed) and only needs the proximal operators for both ${F}$ and ${G^*}$. I also worked on algorithms for slightly more general problems, involving a reformulation of the saddle point problem as a monotone inclusion, with Tom Pock in the paper An accelerated forward-backward algorithm for monotone inclusions and I also should mention this nice approach by Bredies and Sun who consider another reformulation of the monotone inclusion. However, in the spirit of the first point, one should take advantage of all the available structure in the problem, e.g. smoothness of one of the terms. Some algorithm can exploit smoothness of either ${F}$ or ${G^*}$ and only need convexity of the other term. An idea, that has been used for some time already, to tackle the case if ${F}$, say, is a sum of a smooth part and a non-smooth part (and ${G^*}$ is not smooth), is, to dualize the non-smooth part of ${F}$: Say we have ${F = F_1 + F_2}$ with smooth ${F_1}$, then you could write

$\displaystyle \begin{array}{rcl} &\min_u\max_v F_1(u) + F_2(u) + \langle Ku, v\rangle -G^*(v)\\ & \qquad= \max_u \min_{v,w} F_1(u) + \langle u,w\rangle + \langle Ku, v\rangle -G^*(v) - F_2^*(w) \end{array}$

and you are back in business, if your method allows for sums of convex functions in the dual. The trick got the sticky name “dual transportation trick” in a talk by Marc Teboulle here and probably that will help, that I will not forget it from now on…