1. A numerical experiment on sparse regularization
To start, I take a standard problem from the Regularization Tools Matlab toolbox: The problem \texttt{deriv2}. This problem generates a matrix
and two vectors
and
such that the equation
is a Galerkin discretization of the integral equation

with a kernel
such that the solution amounts to solving a boundary value problem. The Galerkin ansatz functions are simply orthonormal characteristic functions on intervals, i.e.
. Thus, I work with matrices
and vectors
and
.
I want to use sparse regularization to reconstruct spiky solutions, that is, I solve problems

Now, my first experiment goes as follows:
Experiment 1 (Discretization goes to zero)
I generate spiky data: I fix a point
in the interval
, namely
, and a value
. Now I consider the data
which is a delta peak of height
and
(which in turn leads to a right hand side
). I construct the corresponding
and the right hand side
. Now I aim at solving

for different discretizations (
). In the numerics, I have to scale
with
, i.e. I solve

and I obtain the following results: In black I show the data
,
and so on, and in blue I plot the minimizer and its image under
.
For
:
For
: 
For
: 
For
: 
For
:
Note that the scale varies in the pictures, except in the lower left one where I show the discretized
. As is should be, this converges nicely to a piecewise linear function. However, the discretization of the solution blows up which is also as it should be, since I discretize a delta peak. Well, this basically shows, that my scaling is correct.
From the paper Sparse regularization with
penalty term one can extract the following result.
Theorem 1 Let
be linear, bounded and injective and let
have finite support. Moreover let
and
. Furthermore, denote with
the minimizer of

Then, for
it holds that

Now let’s observe this convergence rates in a second experiment:
Experiment 2 (Convergence rate
) Now we fix the discretization (i.e.
), and construct a series of
‘s for
in a logscale between
and
. I scale
proportional to
and caluclate minimizers of

The I measure the error
and plot it doubly logarithmically against
.

And there you see the linear convergence rate as predicted.
In a final experiment I vary both
and
:
Experiment 3 (
and “
”)Now we repeat Experiment 1 for different
and put all the loglog plots in one figure. This looks like this: You clearly observe the linear convergence rate in any case. But there is another important thing: The larger the
(i.e. the smaller the
), the later the linear rate kicks is (i.e. for smaller
). You may wonder what the reason is. By looking at the reconstruction for varying
and
(which I do not show here) you see the following behavior: For large noise the regularized solutions consist of several peaks located all over the place and with vanishing noise, one peak close to the original one gets dominant. However, this peak is not at the exact position, but at a slightly larger
; moreover, it is slightly smaller. Then, this peak moves towards the right position and is also getting larger. Finally, the peak arrives at the exact position and remains there while approaching the correct height.

Hence, the linear rate kicks in, precisely when the accuracy is higher than the discretization level.
Conclusion:
- The linear convergence rate is only present in the discrete case. Moreover, it starts at a level which can not be resolved by the discretization.
- “Sparsity penalties” in the continuous case are a different and delicate matter. You may consult the preprint “Inverse problems in spaces of measures” which formulates the sparse recovery problem in a continuous setting but in the space of Radon measures rather than in
(which is simply not working). There Kristian and Hanna show weak* convergence of the minimizers.
- Finally, for “continuous sparsity” also some kind of convergence is true, however, not in norm (which really should be the variation norm in measure space). Weak* convergence can be quantified by the Prokhorov metric or the Wasserstein metric (which is also called earth movers distance in some comiunities). Convergence with respect to these metric should be true (under some assumptions) but seem hard to prove. Convergence rates would be cool, but seem even harder.
Like this:
Like Loading...
In my previous post I announced the draft of the paper “Infeasible-Point Subgradient Algorithm and Computational Solver Comparison for l1-Minimization” which is now available as a preprint at optimization online.
1. Fixed bugs; different results
Basically not much has changed from the draft to the preprint, however, we had to fix some bugs in our computational comparison of solvers and this changed the results. For example,
-magic is now a little better, especially when combined with the heuristic support evaluation (HSE) we propose in the paper. But most notable,
-Homotopy is not the winner anymore. This is due to the fact that we had a conceptual error in our setup. Remember, that
-Homotopy solves that Basis Pursuit denoising problem

starting with
(which results in
) and decreases
while tracking the (piecewise linear) solution path. Provable this reaches the Basis Pursuit solution for
after crossing a finite number of breaks in the solution path. However, in our first experiments we used a final parameter of
. And that was against our rules: We only considered solvers which (in theory) calculate the exact Basis Pursuit solution. Now we reran the calculations with
and surprisingly the results were worse in terms of reconstruction accuracy (of course, also in terms of speed). We did not precisely found out which part of the solver is responsible for this effect, but it should have something to do with the accuracy of the inverse of the submatrix of
which is maintained throughout the iterations.
Another surprise was that the results for
always ended with an approximate solution accuracy (about
) for all test instances (no matter what size, matrix type or number of nonzeros we used). That is a surprise because there is no formula which tells you in advance how accurate the Basis Pursuit denoising for a particular
will be (compared to the Basis Pursuit solution). Maybe an explanation lies is the common features all our test instances share: All matrix columns are normalized to unit Euclidean norm and all non-zero entries in the solutions follow the same distribution.
If you want to have a closer look on our results you can find all the data (i.e. all the running times and solution accuracies for all solvers and all instances) on our SPEAR project website, here.
By the way: Now the overall winner is CPLEX (using the dual simplex method)! So, please stop carrying the message that standard LP solvers are not good for Basis Pursuit…
2. Testset online!
With the submission of the paper, we also made our testset publicly available. You can download all our test instances the website of our SPEAR project both as Matlab .mat files or as ASCII-data (if you would like to use another language). Remember: Each instance comes with a matrix
, a vector
and a vector
which is guaranteed to be the unique solution of
with minimal one-norm. Moreover, there are instance for which the support of the solution is that large, that the minimal-one-norm solution is not necessarily the sparsest solution anymore which is also an interesting borderline case for most Basis Pursuit solvers.
3. ISAL1 online
Also, the Matlab code of ISAL1 (infeasible point subgradient algorithm for
) is online at the website of our SPEAR project. Check it out if you like.
Like this:
Like Loading...
I used to work on “non-convex” regularization with
-penalties, that is, studying the Tikhonov functional

with a linear operator
and
.
The regularization properties are quite nice as shown by Markus Grasmair in “Well-posedness and convergence rates for sparse regularization with sublinear
penalty term” and “Non-convex sparse regularisation” and Kristian Bredies and myself in “Regularization with non-convex separable constraints”.
The next important issue is, to have some way to calculate global minimizers for~(1). But, well, this task may be hard, if not hopeless: Of course one expects a whole lot of local minimizers.
It is quite instructive to consider the simple case in which
is the identity first:
Example 1 Consider the minimization of

This problem separates over the coordinates and hence, can be solved by solving the one-dimensional minimization problem

We observe:
- For
we get
.
- Replacing
by
leads to
instead of
.
Hence, we can reduce the problem to: For
find

One local minimizer is always
since the growth of the
-th power beats the term
. Then,
is large enough, there are two more extrema for~(4) which are given as the solutions to

one of which is a local maximum (the one which is smaller in magnitude) and the other is a local minimum (the one which is larger in magnitude). This is illustrated in the following “bifurcation” picture:

Now the challenge is, to find out which local minimum has the smaller value. And here a strange thing happens: The “switching point” for
at which the global minimizer jumps from
to the upper branch of the (multivalued) inverse of
is not at the place at which the second local minimum occurs. It is a little bit larger: In “Convergence rates and source conditions for Tikhonov regularization with sparsity constraints” I calculated this “jumping point” the as the weird expression

This leads to the following picture of the mapping


1. Iterative re-weighting
One approach to calculate minimizers in~(1) is the so called iterative re-weighting, which appeared at least in “An unconstrained
minimization for sparse solution of under determined linear systems” by Ming-Jun Lai and Jingyue Wang but is probably older. I think for the problem with equality constraints

the approach at least dates back to the 80s but I forgot the reference… For the minimization of (1) the approach goes as follows: For
choose a
and a small
and rewrite the
-quasi-norm as

The necessary condition for a minimizer of

is (formally take the derivative)
![\displaystyle 0 = \alpha \Big[\frac{p}{q} (\epsilon + |x_i|^q)^{\frac{p}{q}-1} q \textup{sgn}(x_i) |x_i|^{q-1}\Big]_i + A^*(Ax-b) \displaystyle 0 = \alpha \Big[\frac{p}{q} (\epsilon + |x_i|^q)^{\frac{p}{q}-1} q \textup{sgn}(x_i) |x_i|^{q-1}\Big]_i + A^*(Ax-b)](http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+0+%3D+%5Calpha+%5CBig%5B%5Cfrac%7Bp%7D%7Bq%7D+%28%5Cepsilon+%2B+%7Cx_i%7C%5Eq%29%5E%7B%5Cfrac%7Bp%7D%7Bq%7D-1%7D+q+%5Ctextup%7Bsgn%7D%28x_i%29+%7Cx_i%7C%5E%7Bq-1%7D%5CBig%5D_i+%2B+A%5E%2A%28Ax-b%29+&bg=ffffff&fg=000000&s=0)
Note that the exponent
is negative (which is also a reason for the introduction of the small
). Aiming at an iteration, we fix some of the
‘s and try to solve for others: If we have a current iterate
we try to find
by solving
![\displaystyle 0 = \alpha \Big[\frac{p}{q} (\epsilon + |x_i^k|^q)^{\frac{p}{q}-1} q \textup{sgn}(x_i) |x_i|^{q-1}\Big]_i + A^*(Ax-b) \displaystyle 0 = \alpha \Big[\frac{p}{q} (\epsilon + |x_i^k|^q)^{\frac{p}{q}-1} q \textup{sgn}(x_i) |x_i|^{q-1}\Big]_i + A^*(Ax-b)](http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+0+%3D+%5Calpha+%5CBig%5B%5Cfrac%7Bp%7D%7Bq%7D+%28%5Cepsilon+%2B+%7Cx_i%5Ek%7C%5Eq%29%5E%7B%5Cfrac%7Bp%7D%7Bq%7D-1%7D+q+%5Ctextup%7Bsgn%7D%28x_i%29+%7Cx_i%7C%5E%7Bq-1%7D%5CBig%5D_i+%2B+A%5E%2A%28Ax-b%29+&bg=ffffff&fg=000000&s=0)
for
. This is the necessary condition for another minimization problem which involves a weighted
-norm: Define (non-negative) weights
an iterate

Lai and Wang do this for
which has the benefit that each iteration can be done by solving a linear system. However, for general
each iteration is still a convex minimization problem. The paper “Convergence of Reweighted
Minimization Algorithms and Unique Solution of Truncated
Minimization” by Xiaojun Chen and Weijun Zhou uses
and both papers deliver some theoretical results of the iteration. Indeed in both cases one can show (subsequential) convergence to a critical point.
Of course the question arises if there is a chance that the limit will be a global minimizer. Unfortunately this is not probable as a simple numerical experiment shows:
Example 2 We apply the iteration (5) to the one dimensional problem (3) in which we know the solution. And we do this for many values of
and plot the value of
against the limit of the iteration. Good news first: Everything converges nicely to critical points as deserved. Even better:
can be really small—machine precision works. The bad news: The limit depends on the initial value. Even worse: The method frequently ends on “the wrong branch”, i.e. in the local minimum which is not global. I made the following experiment: I took
, set
and chose
. First I initialized for all values of
with
. This produced the following output (I plotted every fifth iteration):

Well, the iteration always chose the upper branch… In a second experiment I initialized with a smaller value, namely with
for all
. This gave:

That’s interesting: I ended at the upper branch for all values below the point where the lower branch (the one with the local maximum) crosses the initialization line. This seems to be true in general. Starting with
gave 
Well, probably this is not too interesting: Starting “below the local maximum” will bring you to the local minimum which is lower and vice versa. Indeed Lai and Wang proved in their Theorem 2.5 that for a specific setting (
of completely full rank, sparsity high enough) there is an
small enough such that the method will pick the global minimizer. But wait—they do not say anything about initialization… What happens if we initialize with zero? I have to check…
By the way: A similar experiment as in this example with different values of
showed the same behavior (getting the right branch if the initialization is ok). However: smaller
gave much faster convergence. But remember: For
(experimentally the fastest) each iteration is an
penalized problem while for
one has to solve a linear system. So there seems to be a tradeoff between “small number of iterations in IRLP” and “complexity of the subproblems”.
2. Iterative thresholding
Together with Kristian Bredies I developed another approach to these nasty non-convex minimization problems with
-quasi-norms. We wrote a preprint back in 2009 which is currently under revision. Moreover, we always worked in a Hilbert space setting that is
maps the sequence space
into a separable Hilbert space.
Remark 1 When showing result for problems in separable Hilbert space I sometimes get the impression that others think this is somehow pointless since in the end one always works with
in practice. However, I think that working directly in a separable Hilbert space is preferable since then one can be sure that the results will not depend on the dimension
in any nasty way.
Basically our approach was, to use one of the most popular approaches to the
-penalized problem: Iterative thresholding aka forward-backward splitting aka generalized gradient projection. I prefer the last motivation: Consider the minimization of a smooth function
over a convex set 

by the projected gradient method. That is: do a gradient step and use the projection
to project back onto
:

Now note that the projection is characterized by

Now we replace the “convex constraint”
by a penalty function
, i.e. we want to solve

Then, we just replace the minimization problem for the projection with

and iterate

The crucial thing is, that one needs global minimizers to obtain
. However, for these
penalties with
these are available as we have seem in Example~1. Hence, the algorithm can be applied in the case

One easily proves that one gets descent of the objective functional:
Lemma 1 Let
be weakly lower semicontinuous and differentiable with Lipschitz continuous gradient
with Lipschitz constant
and let
be weakly lower semicontinuous and coercive. Furthermore let
denote any solution of

Then for
it holds that

Proof: Start with the minimizing property

and conclude (by rearranging, expanding the norm-square, canceling terms and adding
to both sides) that

Finally, use Lipschitz-continuity of
to conclude


This gives descent of the functional value as long as
. Now starts the hard part of the investigation: Under what circumstances do we get convergence and what are possible limits?
To make a long story short: For
-penalties (and also other non-convex penalties which leave the origin with infinite slope) and fixed step-size
one gets that every subsequence of the iterates has a strong accumulation point which is a fixed point of the iteration for that particular
as long as
. Well that’s good, but here’s the bad news: As long as
we do not obtain the global minimizer. That’s for sure: Consider
and any
…
However, with considerably more effort one can show the following: For the iteration
with
(and another technical condition on the Lipschitz constant of
) the iterates have a strong accumulation point which is a solution
and hence, satisfies necessary conditions for a global minimizer.
That’s not too bad yet. Currently Kristian and I, together with Stefan Reiterer, work to show that the whole sequence of iterates converges. Funny enough: This seems to be true for
and
with rational
in
… Basically, Stefan was able to show this with the help of Gröbner bases and this has been my first contact with this nice theory. We hope to finalize our revision soon.
Like this:
Like Loading...
Recently Andreas Tillmann presented the poster “An Infeasible-Point Subgradient Algorithm and a Computational Solver Comparison for l1-Minimization” at SPARS11. This poster summarized some results of the project SPEAR on sparse exact and approximate recovery of Marc Pfetsch an myself. We used this as an opportunity to release a draft of the accompanying paper with the same title. Although this draft is not totally ready to be submitted yet, I already summarize its content here.
Is this paper we considered the Basis Pursuit problem (beware: the linked Wikipedia page is stub at this time) from a purely optimization point of view. The Basis Pursuit problem is: For given matrix
(with
) and a vector
, find the solution to

Hence, we mainly neglected all its interesting features of reproducing the sparsest solution of an underdetermined linear system and so on and solely concentrated on its solution as an optimization problem.
The paper has three somehow separated contributions:
- The new algorithm ISAL1: The problem (1) is a convex nonsmooth constrained optimization problem. Marc and Andreas are optimizers and they wondered how the most basic method for this class of problems would perform: The projected subgradient method: For solving

take steps along some negative subgradient and project back to
:
. For (1) subgradients are readily available, e.g.
(taken coordinate-wise). However, projecting onto the constraint
is not too easy. Denoting the projection simply by
, we can give a closed form expression (assuming that
has full rank) as

this has the drawback that one needs to explicitly invert a matrix (which, however, is just
and hence, is usually not too large since we assume
). However, we proposed replace the exact projection by an approximate one: In each step we solve for the projection by a truncated conjugate gradient method. While we expected that one should increase the accuracy of the approximate projection by increasing the number of CG-steps during iteration, surprisingly that is not true: Throughout the iteration, a fixed small number of iterations (say
for matrices of size
but mainly independently of the size) suffices to obtain convergence (and especially feasibility of the iterates). In this paper we give a proof of convergence of the methods under several assumptions on the step-sizes and projection accuracies building on our previous paper in which we analyzed this method in the general case. Moreover, we described several ways to speed up and stabilize the subgradient method. Finally, we called this method “Infeasible point subgradient algorithm for
”: ISAL1. A Matlab implementation can be found and the SPEAR website.
- HSE, the heuristic support evaluation: That’s a pretty neat device which can be integrated in any Basis Pursuit solver (beware: not Basis Pursuit denoising; we want the equality constraint). The idea is based on the following small lemma:
Lemma 1 A feasible vector
(i.e.
) is optimal for (1) if and only if there is
such that
.
The proof basically consists of noting that the normal cone on the constraint
is the image space of
and hence, the condition is equivalent to saying that this normal cone intersects the subgradient
which is necessary and sufficient for
being optimal. In practice the HSE does the following:
- deduce candidate (approx.) support
from a given 
- compute approximate solution
to
by
with the help of CG
- if
check existence of a
with
and

- if that
exists, check if the relative duality gap
is small and return “success” if so, i.e. take
as an optimal solution
Again, CG usually performs great here and only a very few iterations (say
) are needed. In practice this methods did never return any vector
marked as optimal which was wrong.
- Computational comparison: We faced the challenge of a computational comparison for Basis Pursuit solvers.
The first step was, to design a testset. We constructed 100 matrices (74 of which are dense, 26 are sparse) by several constructions and concatenations (see Section 5 in the draft). More complicated was the construction of appropriate right hand sides. Why? Because we wanted to have unique solutions! That is, because we wanted to have the norm difference
between optimal and computed solution as a measure for both optimality and feasibility. In the first place we used the ERC due to Joel Tropp (e.g. described in this blog post of Bob’s blog). However, this does not only guarantee uniqueness of solutions but also that the minimum
-norm solution is also the sparsest. Since that is probably too much to have for solutions (i.e. they have to be very sparse) we constructed some more right hand sides using L1TestPack: Construct an
such that there is
such that
and use
. This also leads to unique solutions for Basis Pursuit if
is injective when restricted to the columns which related to the entries in which
but allows for much larger supports.For the results of the comparison of ISAL1, SPGL1, YALL1,
-MAGIC, SparseLAB, the homotopy solves of Salman Asif and CPLEX check the paper. However, some things are interesting:
- homotopy is the overall winner (which is somehow clear for the instances constructed with ERC but not for others). Great work Salman!
- ISAL1 is quite good (although it is the simplest among all methods).
- HSE works great: Including it e.g. in SPGL1 produces “better” solution in less time.
- CPLEX is remarkably good (we used the dual simplex). So: How does it come that so many people keep saying that standard LP-solves do not work well for Basis Pursuit? That is simply not true for the dual simplex! (However, the interior point methods in CPLEX was not competitive at all.)
We plan to make a somehow deeper evaluation of our computational results before submitting the paper to have some more detailed conclusions on the performance of the solvers an different instances.
Like this:
Like Loading...