I am not an optimizer by training. My road to optimization went through convex analysis. I started with variational methods for inverse problems and mathematical imaging with the goal to derive properties of minimizers of convex functions. Hence, I studied a lot of convex analysis. Later I got interested in how to actually solve convex optimization problems and started to read books about (convex) optimization. At first I was always distracted by the way optimizers treated constraints. To me, a convex optimization problem always looks like

\displaystyle  \min_x F(x).

Everything can be packed into the convex objective. If you have a convex objective {f} and a constraint {c(x) \leq 0} with a convex function {c}, just take {F = f + I_{\{c\leq 0\}}}, i.e., add the indicator function of the constraint to the objective (for some strange reason, Wikipedia has the name and notation for indicator and characteristic function the other way round than I, and many others…). . Similarly for multiple constraints {c_i(x)\leq 0} or linear equality constraints {Ax=b} and such.

In this simple world it is particularly easy to characterize all solutions of convex minimization problems: They are just those {x} for which

\displaystyle  0\in\partial F(x).

Simple as that. Only take the subgradient of the objective and that’s it.

When reading the optimization books and seeing how difficult the treatment of constraints is there, I was especially puzzled how complicated optimality conditions such as KKT looked like in contrast to {0\in\partial F(x)} and also and by the notion of constraint qualifications.

These constraint qualifications are additional assumptions that are needed to ensure that a minimizer {x} fulfills the KKT-conditions. For example, if one has constraints {c_i(x)\leq 0} then the linear independence constraint qualification (LICQ) states that all the gradients {\nabla c_i(x)} for constraints that are “active” (i.e. {c_i(x)=0}) have to be linearly independent.

It took me while to realize that there is a similar issue in my simple “convex analysis view” on optimization: When passing from the gradient of a function to the subgradient, many things stay as they are. But not everything. One thing that does change is the simple sum-rule. If {F} and {G} are differentiable, then {\nabla(F+G)(x) = \nabla F(x) + \nabla G(x)}, always. That’s not true for subgradients! You always have that {\partial F(x) + \partial G(x) \subset \partial(F+G)(x)}. The reverse inclusion is not always true but holds, e.g., if there is some point for which {G} is finite and {F} is continuous. At first glance this sounds like a very weak assumption. But in fact, this is precisely in the spirit of constraint qualifications!

Take two constraints {c_1(x)\leq 0} and {c_2(x)\leq 0} with convex and differentiable {c_{1/2}}. We can express these by {x\in K_i = \{x\ :\ c_i(x)\leq 0\}} ({i=1,2}). Then it is equivalent to write

\displaystyle  \min_x f(x)\ \text{s.t.}\ c_i(x)\leq 0

and

\displaystyle  \min_x (f + I_{K_1} + I_{K_2})(x).

So characterizing solution to either of these is just saying that {0 \in\partial (f + I_{K_1} + I_{K_2})(x)}. Oh, there we are: Are we allowed to pull the subgradient apart? We need to apply the sum rule twice and at some point we need that there is a point at which {I_{K_1}} is finite and the other one {I_{K_2}} is continuous (or vice versa)! But an indicator function is only continuous in the interior of the set where it is finite. So the simplest form of the sum rule only holds in the case where only one of two constraints is active! Actually, the sum rule holds in many more cases but it is not always simple to find out if it really holds for some particular case.

So, constraint qualifications are indeed similar to rules that guarantee that a sum rule for subgradients holds.

Geometrically speaking, both shall guarantee that if one “looks at the constraints individually” one still can see what is going on at points of optimality. It may well be that the sum of individual subgradients is too small to get any points with {0\in \partial F(x) + \partial I_{K_1}(x) + \partial I_{K_2}(x)} but still there are solutions to the optimization problem!

As a very simple illustration take the constraints {K_1 = \{(x,y)\ :\ y\leq 0\}} and {K_2 = \{(x,y)\ :\ y^2\geq x\}} in two dimensions. The first constraint says “be in the lower half-plane” while the second says “be above the parabola {y^2=x}”. Now take the point {(0,0)} which is on the boundary for both sets. It’s simple to see (geometrically and algebraically) that {\partial I_{K_1}(0,0) = \{(0,y)\ :\ y\geq 0\}} and {\partial I_{K_2}(0,0) = \{(0,y)\ :\ y\leq 0\}}, so treating the constraints individually gives {\partial I_{K_1}(0,0) + \partial I_{K_2}(0,0) = \{(0,y)\ :\ y\in{\mathbb R}\}}. But the full story is that {K_1\cap K_2 = \{(0,0)\}}, thus {\partial(I_{K_1} + I_{K_2})(0,0) = \partial I_{K_1\cap K_2}(0,0) = {\mathbb R}^2} and consequently, the subgradient is much bigger.

The mother example of optimization is to solve problems

\displaystyle  \min_{x\in C} f(x)

for functions {f:{\mathbb R}^n\rightarrow{\mathbb R}} and sets {C\in{\mathbb R}^n}. One further classifies problems according to additional properties of {f} and {C}: If {C={\mathbb R}^n} one speaks of unconstrained optimization, if {f} is smooth one speaks of smooth optimization, if {f} and {C} are convex one speaks of convex optimization and so on.

1. Classification, goals and accuracy

Usually, optimization problems do not have a closed form solution. Consequently, optimization is not primarily concerned with calculating solutions to optimization problems, but with algorithms to solve them. However, having a convergent or terminating algorithm is not fully satisfactory without knowing an upper bound on the runtime. There are several concepts one can work with in this respect and one is the iteration complexity. Here, one gives an upper bound on the number of iterations (which are only allowed to use certain operations such as evaluations of the function {f}, its gradient {\nabla f}, its Hessian, solving linear systems of dimension {n}, projecting onto {C}, calculating halfspaces which contain {C}, or others) to reach a certain accuracy. But also for the notion of accuracy there are several definitions:

  • For general problems one can of course desire to be within a certain distance to the optimal point {x^*}, i.e. {\|x-x^*\|\leq \epsilon} for the solution {x^*} and a given point {x}.
  • One could also demand that one wants to be at a point which has a function value close to the optimal one {f^*}, i.e, {f(x) - f^*\leq \epsilon}. Note that for this and for the first point one could also desire relative accuracy.
  • For convex and unconstrained problems, one knowns that the inclusion {0\in\partial f(x^*)} (with the subgradient {\partial f(x)}) characterizes the minimizers and hence, accuracy can be defined by desiring that {\min\{\|\xi\|\ :\ \xi\in\partial f(x)\}\leq \epsilon}.

It turns out that the first two definitions of accuracy are much to hard to obtain for general problems and even for smooth and unconstrained problems. The main issue is that for general functions one can not decide if a local minimizer is also a solution (i.e. a global minimizer) by only considering local quantities. Hence, one resorts to different notions of accuracy, e.g.

  • For a smooth, unconstrained problems aim at stationary points, i.e. find {x} such that {\|\nabla f(x)\|\leq \epsilon}.
  • For smoothly constrained smooth problems aim at “approximately KKT-points” i.e. a point that satisfies the Karush-Kuhn-Tucker conditions approximately.

(There are adaptions to the nonsmooth case that are in the same spirit.) Hence, it would be more honest not write {\min_x f(x)} in these cases since this is often not really the problem one is interested in. However, people write “solve {\min_x f(x)}” all the time even if they only want to find “approximately stationary points”.

2. The gradient method for smooth, unconstrainted optimization

Consider a smooth function {f:{\mathbb R}^n\rightarrow {\mathbb R}} (we’ll say more precisely how smooth in a minute). We make no assumption on convexity and hence, we are only interested in finding stationary points. From calculus in several dimensions it is known that {-\nabla f(x)} is a direction of descent from the point {x}, i.e. there is a value {h>0} such that {f(x - h\nabla f(x))< f(x)}. Hence, it seems like moving into the direction of the negative gradient is a good idea. We arrive at what is known as gradient method:

\displaystyle  x_{k+1} = x_k - h_k \nabla f(x_k).

Now let’s be more specific about the smoothness of {f}. Of course we need that {f} is differentiable and we also want the gradient to be continuous (to make the evaluation of {\nabla f} stable). It turns out that some more smoothness makes the gradient method more efficient, namely we require that the gradient of {f} is Lipschitz continuous with a known Lipschitz constant {L}. The Lipschitz constant can be used to produce efficient stepsizes {h_k}, namely, for {h_k = 1/L} one has the estimate

\displaystyle  f(x_k) - f(x_{k+1})\geq \frac{1}{2L}\|\nabla f(x_k)\|^2.

This inequality is really great because one can use telescoping to arrive at

\displaystyle  \frac{1}{2L}\sum_{k=0}^N \|\nabla f(x_k)\|^2 \leq f(x_0) - f(x_{N+1}) \leq f(x_0) - f^*

with the optimal value {f} (note that we do not need to know {f^*} for the following). We immediately arrive at

\displaystyle  \min_{0\leq k\leq N} \|\nabla f(x_k)\| \leq \frac{1}{\sqrt{N+1}}\sqrt{2L(f(x_0)-f^*))}.

That’s already a result on the iteration complexity! Among the first {N} iterates there is one which has a gradient norm of order {N^{-1/2}}.

However, from here on it’s getting complicated: We can not say anything about the function values {f(x_k)} and about convergence of the iterates {x_k}. And even for convex functions {f} (which allow for more estimates from above and below) one needs some more effort to prove convergence of the functional values to the global minimal one.

But how about convergence of the iterates for the gradient method if convexity is not given? It turns out that this is a hard problem. As illustration, consider the continuous case, i.e. a trajectory of the dynamical system

\displaystyle  \dot x = -\nabla f(x)

(which is a continuous limit of the gradient method as the stepsize goes to zero). A physical intuition about this dynamical system in {{\mathbb R}^2} is as follows: The function {f} describes a landscape and {x} are the coordinates of an object. Now, if the landscape is slippery the object slides down the landscape and if we omit friction and inertia, the object will always slide in the direction of the negative gradient. Consider now a favorable situation: {f} is smooth, bounded from below and the level sets {\{f\leq t\}} are compact. What can one say about the trajectories of the {\dot x = -\nabla f(x)}? Well, it seems clear that one will arrive at a local minimum after some time. But with a little imagination one can see that the trajectory of {x} does not even has to be of finite length! To see this consider a landscape {f} that is a kind of bowl-shaped valley with a path which goes down the hillside in a spiral way such that it winds around the minimum infinitely often. This situation seems somewhat pathological and one usually does not expect situation like this in practice. If you tried to prove convergence of the iterates of gradient or subgradient descent you may have noticed that one sometimes wonders why the proof turns out to be so complicated. The reason lies in the fact that such pathological functions are not excluded. But what functions should be excluded in order to avoid this pathological behavior without restricting to too simple functions?

3. The Kurdyka-Łojasiewicz inequality

Here comes the so-called Kurdyka-Łojasiewicz inequality into play. I do not know its history well, but if you want a pointer, you could start with the paper “On gradients of functions definable in o-minimal structures” by Kurdyka.

The inequality shall be a way to turn a complexity estimate for the gradient of a function into a complexity estimate for the function values. Hence, one would like to control the difference in functional value by the gradient. One way to do so is the following:

Definition 1 Let {f} be a real valued function and assume (without loss of generality) that {f} has a unique minimum at {0} and that {f(0)=0}. Then {f} satisfies a Kurdyka-Łojasiewicz inequality if there exists a differentiable function {\kappa:[0,r]\rightarrow {\mathbb R}} on some interval {[0,r]} with {\kappa'>0} and {\kappa(0)=0} such that

\displaystyle  \|\nabla(\kappa\circ f)(x)\|\geq 1

for all {x} such that {f(x)<r}.

Informally, this definition ensures that one can “reparameterize the range of the function such that the resulting function has a kink in the minimum and is steep around that minimum”. This definition is due to the above paper by Kurdyka from 1998. In fact it is a slight generalization of the Łowasiewicz inequality (which dates back to a note of Łojasiewicz from 1963) which states that there is some {C>0} and some exponent {\theta} such that in the above situation it holds that

\displaystyle  \|\nabla f(x)\|\geq C|f(x)|^\theta.

To see that, take {\kappa(s) = s^{1-\theta}} and evaluate the gradient to {\nabla(\kappa\circ f)(x) = (1-\theta)f(x)^{-\theta}\nabla f(x)} to obtain {1\leq (1-\theta)|f(x)|^{-\theta}\|\nabla f(x)\|}. This also makes clear that in the case the inequality is fulfilled, the gradient provides control over the function values.

The works of Łojasiewicz and Kurdyka show that a large class of functions {f} fulfill the respective inequalities, e.g. piecewise analytic function and even a larger class (termed o-minimal structures) which I haven’t fully understood yet. Since the Kurdyka-Łojasiewicz inequality allows to turn estimates from {\|\nabla f(x_k)\|} into estimates of {|f(x_k)|} it plays a key role in the analysis of descent methods. It somehow explains, that one really never sees pathological behavior such as infinite minimization paths in practice. Lately there have been several works on further generalization of the Kurdyka-Łojasiewicz inequality to the non-smooth case, see e.g. Characterizations of Lojasiewicz inequalities: subgradient flows, talweg, convexity by Bolte, Daniilidis, Ley and Mazet Convergence of non-smooth descent methods using the Kurdyka-Łojasiewicz inequality by Noll (however, I do not try to give an overview over the latest developments here). Especially, here at the French-German-Polish Conference on Optimization which takes place these days in Krakow, the Kurdyka-Łojasiewicz inequality has popped up several times.

In my previous post I announced the draft of the paper “Infeasible-Point Subgradient Algorithm and Computational Solver Comparison for l1-Minimization” which is now available as a preprint at optimization online.

1. Fixed bugs; different results

Basically not much has changed from the draft to the preprint, however, we had to fix some bugs in our computational comparison of solvers and this changed the results. For example, {\ell^1}-magic is now a little better, especially when combined with the heuristic support evaluation (HSE) we propose in the paper. But most notable, {\ell^1}-Homotopy is not the winner anymore. This is due to the fact that we had a conceptual error in our setup. Remember, that {\ell^1}-Homotopy solves that Basis Pursuit denoising problem

\displaystyle  \min_x \frac12\|Ax-b\|_2^2 + \lambda\|x\|_1

starting with {\lambda = \|A^Tb\|_\infty} (which results in {x=0}) and decreases {\lambda} while tracking the (piecewise linear) solution path. Provable this reaches the Basis Pursuit solution for {\lambda=0} after crossing a finite number of breaks in the solution path. However, in our first experiments we used a final parameter of {\lambda = 10^{-9}}. And that was against our rules: We only considered solvers which (in theory) calculate the exact Basis Pursuit solution. Now we reran the calculations with {\lambda=0} and surprisingly the results were worse in terms of reconstruction accuracy (of course, also in terms of speed). We did not precisely found out which part of the solver is responsible for this effect, but it should have something to do with the accuracy of the inverse of the submatrix of {A^TA} which is maintained throughout the iterations.

Another surprise was that the results for {\lambda=10^{-9}} always ended with an approximate solution accuracy (about {10^{-8}}) for all test instances (no matter what size, matrix type or number of nonzeros we used). That is a surprise because there is no formula which tells you in advance how accurate the Basis Pursuit denoising for a particular {\lambda} will be (compared to the Basis Pursuit solution). Maybe an explanation lies is the common features all our test instances share: All matrix columns are normalized to unit Euclidean norm and all non-zero entries in the solutions follow the same distribution.

If you want to have a closer look on our results you can find all the data (i.e. all the running times and solution accuracies for all solvers and all instances) on our SPEAR project website, here.

By the way: Now the overall winner is CPLEX (using the dual simplex method)! So, please stop carrying the message that standard LP solvers are not good for Basis Pursuit…

2. Testset online!

With the submission of the paper, we also made our testset publicly available. You can download all our test instances the website of our SPEAR project both as Matlab .mat files or as ASCII-data (if you would like to use another language). Remember: Each instance comes with a matrix {A}, a vector {b} and a vector {x} which is guaranteed to be the unique solution of {Ax=b} with minimal one-norm. Moreover, there are instance for which the support of the solution is that large, that the minimal-one-norm solution is not necessarily the sparsest solution anymore which is also an interesting borderline case for most Basis Pursuit solvers.

3. ISAL1 online

Also, the Matlab code of ISAL1 (infeasible point subgradient algorithm for {\ell^1}) is online at the website of our SPEAR project. Check it out if you like.

Recently Andreas Tillmann presented the poster “An Infeasible-Point Subgradient Algorithm and a Computational Solver Comparison for l1-Minimization” at SPARS11. This poster summarized some results of the project SPEAR on sparse exact and approximate recovery of Marc Pfetsch an myself. We used this as an opportunity to release a draft of the accompanying paper with the same title. Although this draft is not totally ready to be submitted yet, I already summarize its content here.

Is this paper we considered the Basis Pursuit problem (beware: the linked Wikipedia page is stub at this time) from a purely optimization point of view. The Basis Pursuit problem is: For given matrix {A\in{\mathbb R}^{m\times n}} (with {m<n}) and a vector {b\in{\mathbb R}^m}, find the solution to

\displaystyle \min_{x} \|x\|_1\quad\text{s.t.}\quad Ax = b. \ \ \ \ \ (1)

Hence, we mainly neglected all its interesting features of reproducing the sparsest solution of an underdetermined linear system and so on and solely concentrated on its solution as an optimization problem.

The paper has three somehow separated contributions:

  • The new algorithm ISAL1: The problem (1) is a convex nonsmooth constrained optimization problem. Marc and Andreas are optimizers and they wondered how the most basic method for this class of problems would perform: The projected subgradient method: For solving

    \displaystyle \min_x f(x)\quad\text{s.t.}\quad x\in C

    take steps along some negative subgradient and project back to {C}: {x^{k+1} = P_C(x^k - \alpha_k h^k)}. For (1) subgradients are readily available, e.g. {h^k = \text{sgn}(x^k)} (taken coordinate-wise). However, projecting onto the constraint {Ax=b} is not too easy. Denoting the projection simply by {P}, we can give a closed form expression (assuming that {A} has full rank) as

    \displaystyle P(z) = (I - A^T (AA^T)^{-1} A) z + A^T(AA^T)^{-1}b,

    this has the drawback that one needs to explicitly invert a matrix (which, however, is just {m\times m} and hence, is usually not too large since we assume {m<<n}). However,  we proposed replace the exact projection by an approximate one: In each step we solve for the projection by a truncated conjugate gradient method. While we expected that one should increase the accuracy of the approximate projection by increasing the number of CG-steps during iteration, surprisingly that is not true: Throughout the iteration, a fixed small number of iterations (say {5} for matrices of size {1000\times 4000} but mainly independently of the size) suffices to obtain convergence (and especially feasibility of the iterates). In this paper we give a proof of convergence of the methods under several assumptions on the step-sizes and projection accuracies building on our previous paper in which we analyzed this method in the general case. Moreover, we described several ways to speed up and stabilize the subgradient method. Finally, we called this method “Infeasible point subgradient algorithm for {\ell^1}”: ISAL1. A Matlab implementation can be found and the SPEAR website.

  • HSE, the heuristic support evaluation: That’s a pretty neat device which can be integrated in any Basis Pursuit solver (beware: not Basis Pursuit denoising; we want the equality constraint). The idea is based on the following small lemma:

    Lemma 1 A feasible vector {\bar x} (i.e. {A\bar x = b}) is optimal for (1) if and only if there is {w\in{\mathbb R}^m} such that {A^Tw \in\partial\|\bar x\|_1}.

    The proof basically consists of noting that the normal cone on the constraint {\{Ax=b\}} is the image space of {A^T} and hence, the condition is equivalent to saying that this normal cone intersects the subgradient {\partial\| \bar x\|_1} which is necessary and sufficient for {\bar x} being optimal. In practice the HSE does the following:

    • deduce candidate (approx.) support {S} from a given {x}
    • compute approximate solution {\hat{w}} to {A_{S}^T w = \text{sgn}(x_{S})} by {w = (A_S^T)^\dagger\text{sgn}(x_S)} with the help of CG
    • if {\|A^T \hat{w}\|_\infty \approx 1} check existence of a {\hat{x}} with {A_{S} \hat{x}_{S} = b} and {\hat{x}_i = 0} {\forall\, i \notin S}
    • if that {\hat x} exists, check if the relative duality gap {(\|\hat{x}\|_1 + b^T (-\hat{w}))/\|\hat{x}\|_1} is small and return “success” if so, i.e. take {\hat x} as an optimal solution

    Again, CG usually performs great here and only a very few iterations (say {5}) are needed. In practice this methods did never return any vector {\hat x} marked as optimal which was wrong.

  • Computational comparison: We faced the challenge of a computational comparison for Basis Pursuit solvers.
    The first step was, to design a testset. We constructed 100 matrices (74 of which are dense, 26 are sparse) by several constructions and concatenations (see Section 5 in the draft). More complicated was the construction of appropriate right hand sides. Why? Because we wanted to have unique solutions! That is, because we wanted to have the norm difference {\|x^*-\hat x\|_2} between optimal and computed solution as a measure for both optimality and feasibility. In the first place we used the ERC due to Joel Tropp (e.g. described in this blog post of Bob’s blog). However, this does not only guarantee uniqueness of solutions but also that the minimum {1}-norm solution is also the sparsest. Since that is probably too much to have for solutions (i.e. they have to be very sparse) we constructed some more right hand sides using L1TestPack: Construct an {x} such that there is {w} such that {A^T w \in\partial \|x\|_1} and use {b = Ax}. This also leads to unique solutions for Basis Pursuit if A is injective when restricted to the columns which related to the entries in which (A^T w)_i = \pm 1 but allows for much larger supports.For the results of the comparison of ISAL1, SPGL1, YALL1, {\ell^1}-MAGIC, SparseLAB, the homotopy solves of Salman Asif and CPLEX check the paper. However, some things are interesting:

    1. homotopy is the overall winner (which is somehow clear for the instances constructed with ERC but not for others). Great work Salman!
    2. ISAL1 is quite good (although it is the simplest among all methods).
    3. HSE works great: Including it e.g. in SPGL1 produces “better” solution in less time.
    4. CPLEX is remarkably good (we used the dual simplex). So: How does it come that so many people keep saying that standard LP-solves do not work well for Basis Pursuit? That is simply not true for the dual simplex! (However, the interior point methods in CPLEX was not competitive at all.)

    We plan to make a somehow deeper evaluation of our computational results before submitting the paper to have some more detailed conclusions on the performance of the solvers an different instances.