I used to work on “non-convex” regularization with {\ell^p}-penalties, that is, studying the Tikhonov functional

\displaystyle \frac12 \|Ax-b\|_2^2 + \alpha\sum_{i}|x_i|^p \ \ \ \ \ (1)

 

with a linear operator {A} and {0<p<1}.

The regularization properties are quite nice as shown by Markus Grasmair in his two papers “Well-posedness and convergence rates for sparse regularization with sublinear {l^q} penalty term” and “Non-convex sparse regularisation” and by Kristian Bredies and myself in “Regularization with non-convex separable constraints”.

The next important issue is to have some way to calculate global minimizers for (1). But, well, this task may be hard, if not hopeless. Of course one expects a whole lot of local minimizers.

It is quite instructive to consider the simple case in which {A} is the identity first:

Example 1 Consider the minimization of

\displaystyle F(x) = \frac12\|x-b\|_2^2 + \alpha\sum_i |x_i|^p. \ \ \ \ \ (2)

 

This problem separates over the coordinates and hence, can be solved by solving the one-dimensional minimization problem

\displaystyle s^*\in\textup{arg}\min_s \frac12 (s-b)^2 + \alpha|s|^p. \ \ \ \ \ (3)

 

We observe:

  • For {b\geq 0} we get {s^*\geq 0}.
  • Replacing {b} by {-b} leads to {-s^*} instead of {s^*}.

Hence, we can reduce the problem to: For {b\geq 0} find

\displaystyle s^* \in\textup{arg}\min_{s\geq 0} \frac12 (s-b)^2 + \alpha\, s^p. \ \ \ \ \ (4)

 

One local minimizer is always {s^*=0} since the growth of the {p}-th power beats the term {(\cdot-b)^2}. Then, for {b} large enough, there are two more extrema for~(4) which are given as the solutions to

\displaystyle s + \alpha p s^{p-1} = b

one of which is a local maximum (the one which is smaller in magnitude) and the other is a local minimum (the one which is larger in magnitude). This is illustrated in the following “bifurcation” picture:

Now the challenge is, to find out which local minimum has the smaller value. And here a strange thing happens: The “switching point” for {b} at which the global minimizer jumps from {0} to the upper branch of the (multivalued) inverse of {s\mapsto s + \alpha p s^{p-1}} is not at the place at which the second local minimum occurs. It is a little bit larger: In “Convergence rates and source conditions for Tikhonov regularization with sparsity constraints” I calculated this “jumping point” the as the weird expression

\displaystyle b^* = \frac{2-p}{2-2p}\Bigl(2\alpha(1-p)\Bigr)^{\frac{1}{2-p}}.

This leads to the following picture of the mapping

\displaystyle b^\mapsto \textup{arg}\min_s \frac12 (s-b)^2 + \alpha|s|^p

1. Iterative re-weighting

One approach to calculate minimizers in~(1) is the so called iterative re-weighting, which appeared at least in “An unconstrained {\ell^q} minimization for sparse solution of under determined linear systems” by Ming-Jun Lai and Jingyue Wang but is probably older. I think for the problem with equality constraints

\displaystyle \min \|x\|_q\ \textup{ s.t. }\ Ax=b

the approach at least dates back to the 80s but I forgot the reference… For the minimization of (1) the approach goes as follows: For {0<p<1} choose a {q\geq 1} and a small {\epsilon>0} and rewrite the {p}-quasi-norm as

\displaystyle \sum_i |x_i|^p \approx \sum_i (\epsilon + |x_i|^q)^{\frac{p}{q}}.

The necessary condition for a minimizer of

\displaystyle \frac12\|Ax-b\|_2^2 + \alpha\sum_i (\epsilon + |x_i|^q)^{\frac{p}{q}}

is (formally take the derivative)

\displaystyle 0 = \alpha \Big[\frac{p}{q} (\epsilon + |x_i|^q)^{\frac{p}{q}-1} q \textup{sgn}(x_i) |x_i|^{q-1}\Big]_i + A^*(Ax-b)

Note that the exponent {\frac{p}{q}-1} is negative (which is also a reason for the introduction of the small {\epsilon}). Aiming at an iteration, we fix some of the {x}‘s and try to solve for others: If we have a current iterate {x^k} we try to find {x^{k+1}} by solving

\displaystyle 0 = \alpha \Big[\frac{p}{q} (\epsilon + |x_i^k|^q)^{\frac{p}{q}-1} q \textup{sgn}(x_i) |x_i|^{q-1}\Big]_i + A^*(Ax-b)

for {x}. This is the necessary condition for another minimization problem which involves a weighted {q}-norm: Define (non-negative) weights {w^k_i = \frac{p}{q} (\epsilon + |x^k_i|^p)^{\frac{p}{q}-1}} an iterate

\displaystyle x^{k+1}\in \textup{arg}\min_x \frac12\|Ax-b\|_2^2 + \alpha\sum_i w_i^k |x_i|^q. \ \ \ \ \ (5)

 

Lai and Wang do this for {q=2} which has the benefit that each iteration can be done by solving a linear system. However, for general {1\leq q\leq 2} each iteration is still a convex minimization problem. The paper “Convergence of Reweighted {\ell^1} Minimization Algorithms and Unique Solution of Truncated {\ell^p} Minimization” by Xiaojun Chen and Weijun Zhou uses {q=1} and both papers deliver some theoretical results of the iteration. Indeed in both cases one can show (subsequential) convergence to a critical point.

Of course the question arises if there is a chance that the limit will be a global minimizer. Unfortunately this is not probable as a simple numerical experiment shows:

Example 2 We apply the iteration (5) to the one dimensional problem (3) in which we know the solution. We do this for many values of {b} and plot the value of {b} against the limit of the iteration. Good news first: Everything converges nicely to critical points as expected. Even better: {\epsilon} can be really small–machine precision works. The bad news: The limit depends on the initial value. Even worse: The method frequently ends on “the wrong branch”, i.e. in the local minimum which is not global. I made the following experiment: I took {p=1/2}, set {\alpha=1} and chose {q=2}. First I initialized for all values of {b} with {s^0=1}. This produced the following output (I plotted every fifth iteration):

Well, the iteration always chose the upper branch… In a second experiment I initialized with a smaller value, namely with {s^0=0.1} for all {b}. This gave:

That’s interesting: I ended at the upper branch for all values below the point where the lower branch (the one with the local maximum) crosses the initialization line. This seems to be true in general. Starting with {s^0=0.05} gave
Well, probably this is not too interesting: Starting “below the local maximum” will bring you to the local minimum which is lower and vice versa. Indeed Lai and Wang proved in their Theorem 2.5 that for a specific setting ({A} of completely full rank, sparsity high enough) there is an {\alpha} small enough such that the method will pick the global minimizer. But wait—they do not say anything about initialization… What happens if we initialize with zero? I have to check…

By the way: A similar experiment as in this example with different values of {q\geq 1} showed the same behavior (getting the right branch if the initialization is ok). However: smaller {q} gave much faster convergence. But remember: For {q=1} (experimentally the fastest) each iteration is an {\ell^1} penalized problem while for {q=2} one has to solve a linear system. So there seems to be a tradeoff between “small number of iterations in IRLP” and “complexity of the subproblems”.

2. Iterative thresholding

Together with Kristian Bredies I developed another approach to these nasty non-convex minimization problems with {\ell^p}-quasi-norms. We wrote a preprint back in 2009 which is currently under revision. Moreover, we always worked in a Hilbert space setting that is {A} maps the sequence space {\ell^2} into a separable Hilbert space.

Remark 1 When showing results for problems in separable Hilbert space I sometimes get the impression that others think this is somehow pointless since in the end one always works with {{\mathbb R}^N} in practice. However, I think that working directly in a separable Hilbert space is preferable since then one can be sure that the results will not depend on the dimension {N} in any nasty way.

Basically our approach was, to use one of the most popular approaches to the {\ell^1}-penalized problem: Iterative thresholding aka forward-backward splitting aka generalized gradient projection. I prefer the last motivation: Consider the minimization of a smooth function {F} over a convex set {C}

\displaystyle \min_{x\in C} F(x)

by the projected gradient method. That is: do a gradient step and use the projection {P_C} to project back onto {C}:

\displaystyle x^{n+1} = P_C(x^n - s_n \nabla F(x^n)).

Now note that the projection is characterized by

\displaystyle P_C(x) = \textup{arg}\min_{y\in C}\frac{1}{2}\|y-x\|^2.

Now we replace the “convex constraint” {C} by a penalty function {\alpha R}, i.e. we want to solve

\displaystyle \min_x F(x) + \alpha R(x).

Then, we just replace the minimization problem for the projection with

\displaystyle P_s(x) = \textup{arg}\min_{y}\frac{1}{2}\|y-x\|^2 + s\alpha R(y)

and iterate

\displaystyle x^{n+1} = P_{s_n}(x^n - s_n \nabla F (x^n)).

The crucial thing is, that one needs global minimizers to obtain {P_s}. However, for these {\ell^p} penalties with {0<p<1} these are available as we have seem in Example~1. Hence, the algorithm can be applied in the case

\displaystyle F(x) = \tfrac{1}{2}\|Ax-y\|^2,\qquad R(x) = \sum_i |x_i|^p.

One easily proves that one gets descent of the objective functional:

Lemma 1 Let {F} be weakly lower semicontinuous and differentiable with Lipschitz continuous gradient {\nabla F} with Lipschitz constant {L} and let {R} be weakly lower semicontinuous and coercive. Furthermore let {P_s(x)} denote any solution of

\displaystyle \min_y \tfrac{1}{2}\|y-x\|^2 + s\alpha R(y).

Then for {y = P_s(x - s\nabla F(x))} it holds that

\displaystyle F(y) + \alpha R(y) \leq F(x) + \alpha R(x) - \tfrac{1}{2}\big(\tfrac{1}{s} - L\big)\|y-x\|^2.

Proof: Start with the minimizing property

\displaystyle \tfrac{1}{2}\|y - (x- s\nabla F(x))\|^2 + s\alpha R(y) \leq \tfrac{1}{2}\|s\nabla F(x)\|^2 + s\alpha R(x).

and conclude (by rearranging, expanding the norm-square, canceling terms and adding {F(y) - F(x)} to both sides) that

\displaystyle (F+\alpha R)(y) - (F+\alpha R)(x) \leq F(y) - F(x) - \langle \nabla F(x),y-x\rangle - \tfrac{1}{2s}\|y-x\|^2.

Finally, use Lipschitz-continuity of {\nabla F} to conclude

\displaystyle F(y) - F(x) - \langle \nabla F(x),y-x\rangle \leq \tfrac{L}{2}\|x-y\|^2.

\Box

This gives descent of the functional value as long as {0< s < 1/L}. Now starts the hard part of the investigation: Under what circumstances do we get convergence and what are possible limits?

To make a long story short: For {\ell^p}-penalties (and also other non-convex penalties which leave the origin with infinite slope) and fixed step-size {s_n=s} one gets that every subsequence of the iterates has a strong accumulation point which is a fixed point of the iteration for that particular {s} as long as {0< s< 1/L}. Well that’s good, but here’s the bad news: As long as {s<1/L} we do not obtain the global minimizer. That’s for sure: Consider {F(x) = \tfrac{1}{2}\|x-b\|^2} and any {0<s<1}

However, with considerably more effort one can show the following: For the iteration {x^{n+1} = P_{s_n}(x^n - s_n \nabla F(x))} with {s_n = (L + 1/n)^{-1}\rightarrow 1/L} (and another technical condition on the Lipschitz constant of {\nabla F}) the iterates have a strong accumulation point which is a solution {x = P_{1/L}(x - 1/L\,\nabla F(x)} and hence, satisfies necessary conditions for a global minimizer.

That’s not too bad yet. Currently Kristian and I, together with Stefan Reiterer, work to show that the whole sequence of iterates converges. Funny enough: This seems to be true for {F(x) = \tfrac{1}{2}\|Ax-b\|^2} and {R(x) = \sum_i |x_i|^p} with rational {p} in {]0,1[}… Basically, Stefan was able to show this with the help of Gröbner bases and this has been my first contact with this nice theory. We hope to finalize our revision soon.