Regularization


If you are working on optimization with partial differential equations as constraints, you may be interested in the website

“OPTPDE – A Collection of Problems in PDE-Constrained Optimization”, http://www.optpde.net.

If you have developed an algorithm which can handle a certain class of optimization problems you need to do evaluations and tests on how well the method performs. To do so, you need well constructed test problems. This could be either problems where the optimal solution is known analytically our problems where the solution is known with a rigorous error bound obtained with a bullet-proof solver. Both things are not always easy to obtain and OPTPDE shall serve as a resource for such problems. It has been designed by Roland Herzog, Arnd Rösch, Stefan Ulbrich and Winnifried Wollner.

The generation of test instance for optimization problems seems quite important to me and indeed, several things can go wrong if this is not done right. Frequently, one sees tests for optimization routines on problems where the optimal solution is not known. Since there are usually different ways to express optimality conditions it is not always clear how to check for optimality; even more so, if you only check for “approximate optimality”, e.g. up to machine precision. A frequently observed effect is a kind of “trusted method bias”. By this I mean that an optimal solution is calculated by some trusted method and comparing the outcome of the tested routine with this solution. However, the trusted method uses some stopping criterion usually based on some specific set of formulations of optimality conditions and these can be different from what the new method has been tuned to. And most often, the stopping criteria do not give a rigorous error bound for the solution or the optimal objective value.

For sparse reconstruction problems, I dealt with this issue in “Constructing test instance for Basis Pursuit Denoising” (preprint available here) but I think this methodology could be used for other settings as well.

Today there are several things I could blog on. The first is the planary by Rich Baraniuk on Compressed Sensing. However, I don’t think that I could reflect the content in a way which would be helpful for a potential reader. Just for the record: If you have the chance to visit one of Rich’s talk: Do it!

The second thing is the talk by Bernd Hofmann on source conditions, smoothness and variational inequalities and their use in regularization of inverse problems. However, this would be too technical for now and I just did not take enough notes to write a meaningful post.

As a third thing I have the talk by Christian Clason on inverse problems with uniformly distributed noise. He argued that for uniform noise it is much better to use an {L^\infty} discrepancy term instead of the usual {L^2}-one. He presented a path-following semismooth Newton method to solve the problem

\displaystyle \min_x \frac{1}{p}\|Kx-y^\delta\|_\infty^p + \frac{\alpha}{2}\|x\|_2^2

and showed examples with different kinds of noise. Indeed the examples showed that {L^\infty} works much better than {L^2} here. But in fact it works even better, if the noise is not uniformly distributed but “impulsive” i.e. it attains bounds {\pm\delta} almost everywhere. It seems to me that uniform noise would need a slightly different penalty but I don’t know which one – probably you do? Moreover, Christian presented the balancing principle to choose the regularization parameter (without knowledge about the noise level) and this was the first time I really got what it’s about. What one does here is, to choose {\alpha} such that (for some {\sigma>0} which only depends on {K}, but not on the noise)

\displaystyle \sigma\|Kx_\alpha^\delta-y^\delta\|_\infty = \frac{\alpha}{2}\|x_\alpha^\delta\|_2^2.

The rational behind this is, that the left hand side is monotonically non-decreasing in {\alpha}, while the right hand side is monotonically non-increasing. Hence, there should be some {\alpha} “in the middle” which make both somewhat equally large. Of course, we do neither want to “over-regularize” (which would usually “smooth too much”) nor to “under-regularize” (which would not eliminate noise). Hence, balancing seems to be a valid choice. From a practical point of view the balancing is also nice because one can use the fixed-point iteration

\displaystyle \alpha^{n+1} = 2\sigma\frac{\|Kx_{\alpha^n}^\delta - y^\delta\|_\infty}{\|x_{\alpha_n}^\delta\|_2^2}

which converges in a few number of iterations.

Then there was the talk by Esther Klann, but unfortunately, I was late so only heard the last half…

Last but not least we have the talk by Christiane Pöschl. If you are interested in Total-Variation-Denoising (TV denoising), then you probably have heard many times that “TV denoising preserves edges” (have a look at the Wikipedia page – it claims this twice). What Christiane showed (in a work with Vicent Caselles and M. Novaga) that this claim is not true in general but only for very special cases. In case of characteristic functions, the only functions for which the TV minimizer has sharp edges are these so-called calibrated sets, introduced by Caselles et el. Building on earlier works by Caselles and co-workers she calculated exact minimizers for TV denoising in the case that the image consists of characteristic functions of two convex sets or of a single star shaped domain, that is, for a given set B she calculated the solution of

\displaystyle \min_u\int (u - \chi_B)^2dx + \lambda \int|Du|.

This is not is as easy as it may sound. Even for the minimizer for a single convex set one has to make some effort. She presented a nice connection of the shape of the obtained level-sets with the morphological operators of closing and opening. With the help of this link she derived a methodology to obtain the exact TV denoising minimizer for all parameters. I do not have the images right now but be assured that most of the time, the minimizers do not have sharp edges all over the place. Even for simple geometries (like two rectangles touching in a corner) strange things happen and only very few sharp edges appear. I’ll keep you posted in case the paper comes out (or appears as a preprint).

Christiane has some nice images which make this much more clear:

For two circles edges are preserved if they are far enough away from each other. If they are close, the area “in between” them is filled and, moreover, obey this fuzzy boundary. I remember myself seeing effects like this in the output of TV-solvers and thinking “well, it seems that the algorithm is either not good or not converged yet – TV should output sharp edges!”.

 

For a star-shaped shape (well, actually a star) the output looks like this. The corners are not only rounded but also blurred and this is true both for the “outer” corners and the “inner” corners.

 

So, if you have any TV-minimizing code, go ahead and check if your code actually does the right things on images like this!
Moreover, I would love to see similar results for more complicated extensions of TV like Total Generalized Variation, I treated here.

 

 

 

 

In this post I gladly announce that three problems that bothered me have been solved: The computational complexity of certifying RIP and NSP and the number of steps the homotopy method needs to obtain a solution of the Basis Pursuit problem.

1. Complexity of RIP and NSP

On this issue we have two papers:

The first paper has the more general results and hence, we start with the second one: The main result of the second paper is this:

Theorem 1 Let a matrix {A}, a positive integer {K} and some {0<\delta<1} be given. It is hard for NP under randomized polynomial-time reductions to check if {A} satisfies the {(K,\delta)} restricted isometry property.

That does not yet say that it’s NP-hard to check if {\delta} is an RIP constant for {K}-sparse vectors but it’s close. I think that Dustin Mixon has explained this issue better on his blog than I could do here.

In the first paper (which is, by the way, on outcome of the SPEAR-project in which I am involved…) the main result is indeed the conjectured NP-hardness of calculating RIP constants:

Theorem 2 For a given matrix {A} and a positive integer {K}, it is NP-hard to compute the restricted isometry constant.

Moreover, this is just a corollary to the main theorem of that paper which reads as

Theorem 3 For a given matrix {A} and a positive integer {K}, the problem to decide whether {A} satisfies the restricted isometry property of order {K} for some constant {\delta<1} is coNP-complete.

They also provide a slightly strengthened version of Theorem~1:

Theorem 4 Let a matrix {A}, a positive integer {K} and some {0<\delta<1} be given. It is coNP-complete to check if {A} satisfies the {(K,\delta)} restricted isometry property.

Moreover, the paper by Pfetsch and Tillmann also proves something about the null space property (NSP):

Definition 5 A matrix {A} satisfies the null space property of order {K} if there is a constant {\alpha>0} such that for all elements {x} in the null space of {A} it holds that the sum of the {K} largest absolute values of {x} is smaller that {\alpha} times the 1-norm of {x}. The smallest such constant {\alpha} is called the null space constant of order {K}.

Their main result is as follows:

Theorem 6 F or a given matrix {A} and a positive integer {K}, the problem to decide whether {A} satisfies the null space property order {K} for some constant {\alpha<1} is coNP-complete. Consequently, it is NP-hard to compute the null space constant of {A}.

2. Complexity of the homotopy method for Basis Pursuit

The second issue is about the basis pursuit problem

\displaystyle  \min_x \|x\|_1\quad\text{s.t.}\ Ax=b.

which can be approximated by the “denoising variant”

\displaystyle  \min_x \lambda\|x\|_1 + \tfrac12\|Ax-b\|_2^2.

What is pretty interesting about the denoising variant is, that the solution {x(\lambda)} (if it is unique throughout) depends on {\lambda} in a piecewise linear way and converges to the solution of basis pursuit for {\lambda\rightarrow 0}. This leads to an algorithm for the solution of basis pursuit: Start with {\lambda=\|A^Tb\|_\infty} (for which the unique solution is {x(\lambda)=0}), calculate the direction of the “solution path”, follow it until you reach a “break point”, calculate the next direction and so on until {\lambda} hits zero. This is for example implemented for MATLAB in L1Homotopy (the SPAMS package also seems to have this implemented, however, I haven’t used it yet). In practice, this approach (usually called homotopy method) is pretty fast and moreover, only detects a few break points. However, an obvious upper bound on the number of break point is exponential in the number of entries in {x}. Hence, it seemed that one was faced with a situation similar to the simplex method for linear programming: The algorithms performs great an average but the worst case complexity is bad. That this is really true for linear programming is known since some time by the Klee-Minty example, an example for which the simplex method takes an exponential number of steps. What I asked myself for some time: Is there a Klee-Minty example for the homotopy method?

Now the answer is there: Yes, there is!

The denoising variant of basis pursuit is also known as LASSO regularization in the statistics literature and this explains the title of the paper which comes up with the example:

Julien and Bin investigate the number of linear segments in the regularization path and first observe that this is upper bounded by {(3^p+1)/2} is {p} is the number of entries in {x} (i.e. the number of variables of the problem). Then they try to construct an instance that matches this upper bound. They succeed in a clever way: For a given instance {(A,b)} with a path with {k} linear segments they try to construct an instance which has one more variable such that the number of linear segments in increased by a factor. Their result goes like this:

Theorem 7 Let {A\in{\mathbb R}^{n\times p}} have full rank and let {b\in{\mathbb R}^n} be in the range of {A}. Assume that the homotopy path has {k} linear segments and denote by {\lambda_1} the regularization parameter which corresponds to the smallest kink in the path. Now choose {b_{n+1}\neq 0} and {\alpha} such that

\displaystyle   0<\alpha < \frac{\lambda_1}{2\|b\|_2^2 + b_{n+1}^2} \ \ \ \ \ (1)

and define {\tilde b\in{\mathbb R}^{n+1}} and {\tilde A\in{\mathbb R}{(n+1)\times (p+1)}} by

\displaystyle  \tilde b = \begin{bmatrix} b\\ b_{n+1} \end{bmatrix}, \quad \tilde A = \begin{bmatrix} A & 2\alpha b\\ 0 & \alpha b_{n+1} \end{bmatrix}.

Then the homotopy path for the basis pursuit problem with matrix {\tilde A} and right hand side {\tilde b} has {3k-1} linear segments.

With this theorem at hand, it is straightforward to recursively build a “Mairal-Yu”-example which matches the upper bound for the number of linear segments. The idea is to start with a {1\times 1} example and let it grow by one row and one column according to Theorem~7. We start with the simplest {1\times 1} example, namely {A = [1]} and {b=[1]}. To move to the next bigger example you can choose the next entry {b_{n+1}} and we always choose {1} for convenience. Moreover, you need the next {\alpha} and you need to know the smallest kink in the path. I calculated the paths and kinks with L1Packv2 by Ignace Loris because it is written in Mathematica and can use exact arithmetics with rational numbers (and you will see, that accuracy will be an issue even for small instances) and seemed bullet proof for me. Let’s see where this idea brings us:

Example 1 (Mairal-Yu example)

  • Stage 1: We start with {n=p=1}, {b=[1]} and {A=[1]}. The homotopy path has one kink at {\lambda_1=1} (with corresponding solution {[0]}) and hence, two linear segments. Now let’s go to the next larger instance:
  • Stage 2: We can choose the entry {b_2} as we like and choose it equals to 1, i.e. our new {b} is

    \displaystyle  b = \begin{bmatrix} 1\\1 \end{bmatrix}.

    Now we have to choose {\alpha} according to (1), i.e

    \displaystyle  0 < \alpha < \frac{\lambda_1}{2\|b\|_2^2 + b_{n+1}^2} = \frac{1}{2+1} = \frac{1}{3}

    and we can choose, e.g., {\alpha = 1/4} which gives our new matrix

    \displaystyle  A = \begin{bmatrix} 1 & \frac12\\ 0 & \frac14 \end{bmatrix}.

    The calculation of the new regularization path shows that it has exactly the announced number of 5 segments and the parameter of the smallest kink is {\lambda_1 = \frac{1}{13}}.

  • Stage 2: Again we choose {b_{n+1} = 1} giving

    \displaystyle  b = \begin{bmatrix} 1\\1\\1 \end{bmatrix}

    For the choice of {\alpha} we need that

    \displaystyle  0<\alpha < \frac{1}{13(4+1)} = \frac{1}{75}

    and we may choose

    \displaystyle  \alpha = \frac1{80}.

    which gives the next matrix

    \displaystyle  A = \begin{bmatrix} 1 & \frac12 & \tfrac{1}{40}\\ 0 & \frac14 & \tfrac{1}{40}\\ 0 & 0 & \tfrac{1}{80} \end{bmatrix}.

    We calculate the regularization path, observe that it has the predicted 14 segments and that the parameter of the smallest kink is {\lambda_1 = \frac{1}{193}}.

  • Stage 3: Again we choose {b_{n+1} = 1} giving

    \displaystyle  b = \begin{bmatrix} 1\\1\\1\\1 \end{bmatrix}

    For the choice of {\alpha} we need that

    \displaystyle  0<\alpha < \frac{1}{193(6+1)} = \frac{1}{1351}

    and we see that things are getting awkward here…

Proceeding in this way we always increase the number of linear segments {k_n} for the {n\times n}-case from {k_n} to {k_{n+1} = 3k_n-1} in each step and one checks easily that this leads to {k_n = (3^n+1)/2} which is the worst case! If you are interested in the regularization path: I produced picture for the first three dimensions (well, I could not draw a 4d {\ell^1}-ball) and here they are:

1d Mairal-Yu example


2d Mairal-Yu example


3d Mairal-Yu example

It is not really easy to perceive the whole paths from the pictures because the magnitude of the entries vary strongly. I’ve drawn the path in red, each kink marked with a small circle. Moreover, I have drawn the according {\ell^1}-balls of the respective radii to provide more geometric information.

The paper by Mairal and Yu has more results of the paths if one looks for approximate solutions of the linear system but I will not go into detail about them here.

At least two questions come to mind:

  • The Mairal-Yu example is {n\times n}. What is the worst case complexity for the true rectangular case? In other words: What is the complexity for {p\times n} in terms of {p} and {n}?
  • The example and the construction leads to matrices that does not have normed columns and moreover, they are far from being equal in norm. But matrices with normed columns seem to be more “well behaved”. Does the worst case complexity lowers if the consider matrices with unit-norm columns? Probably one can construct a unit-norm example by proper choice of {b}

Today I’d like to blog about two papers which appeared on the arxiv.

1. Regularization with the Augmented Lagrangian Method – Convergence Rates from Variational Inequalities

The first one is the paper “Regularization of Linear Ill-posed Problems by the Augmented Lagrangian Method and Variational Inequalities” by Klaus Frick and Markus Grasmair.

Well, the title basically describes the content quite accurate. However, recall that the Augmented Lagrangian Method (ALM) is a method to calculate solutions to certain convex optimization problems. For a convex, proper and lower-semicontinuous function {J} on a Banach space {X}, a linear and bounded operator {K:X\rightarrow H} from {X} into a Hilbert space {H} and an element {g\in H} consider the problem

\displaystyle   \inf_{u} J(u)\quad\text{s.t.}\quad Ku=g. \ \ \ \ \ (1)

The ALM goes as follows: Start with an initial dual variable {p_0}, choose step-sizes {\tau_k>0} and iterate

\displaystyle  u_k \in \text{argmin}\Big(\frac{\tau_k}{2}\|Ku-g\|^2 + J(u) + \langle p_{k-1},Ku-g\rangle\Big)

\displaystyle  p_k = p_{k-1}+\tau_k(g-Ku_k).

(These days one should note that this iteration is also known under the name Bregman iteration…). Indeed, it is known that the ALM converges to a solution of (1) if there exists one. Klaus and Markus consider the ill-posed case, i.e. the range of {K} is not closed and {g} is replaced by some {g^\delta} which fulfills {\|g-g^\delta\|\leq\delta} (and hence, {g^\delta} is generally not in the range of {K}). Then, the ALM does not converge but diverges. However, one observes “semi-convergence” in practice, i.e. the iterates approach an approximate “solution to {Ku=g^\delta}” (or even a true solution to {Ku=g}) first but then start to diverge from some point on. Then it is natural to ask, if the ALM with {g} replaced by {g^\delta} can be used for regularization, i.e. can one choose a stopping index {k^*} (depending on {\delta} and {g^\delta}) such that the iterates {u_{k^*}^\delta} approach the solution of (1) if {\delta} vanishes? The question has been answered in the affirmative in previous work by Klaus (here and here) and also estimates on the error and convergence rates have been derived under an additional assumption on the solution of (1). This assumption used to be what is called “source condition” and says that there should exist some {p^\dag\in H} such that for a solution {u^\dagger} of (1) it holds that

\displaystyle  K^* p^\dagger \in\partial J(u^\dagger).

Under this assumption it has been shown that the Bregman distance {D(u_{k^*}^\delta,u^\dag)} goes to zero linearly in {\delta} under appropriate stopping rules. What Klaus and Markus investigate in this paper are different conditions which ensure slower convergence rates than linear. These conditions come in the form of “variational inequalities” which gained some popularity lately. As usual, these variational inequalities look some kind of messy at first sight. Klaus and Markus use

\displaystyle  D(u,u^\dag)\leq J(u) - J(u^\dag) + \Phi(\|Ku-g\|^2)

for some positive functional {D} with {D(u,u)=0} and some non-negative, strictly increasing and concave function {\Phi}. Under this assumption (and special {D}) they derive convergence rates which again look quite complicated but can be reduced to simpler and more transparent cases which resemble the situation one knows for other regularization methods (like ordinary Tikhonov regularization).

In the last section Klaus and Markus also treat sparse regularization (i.e. with {J(u) = \|u\|_1}) and derive that a weak condition (like {(K^*K)^\nu p^\dag\in\partial J(u^\dag)} for some {0<\nu<1/2} already imply the stronger one (1) (with a different {p^\dag}). Hence, interestingly, it seems that for sparse regularization one either gets a linear rate or nothing (in this framework).

2. On necessary conditions for variational regularization

The second paper is “Necessary conditions for variational regularization schemes” by Nadja Worliczek and myself. I have discussed some parts of this paper alread on this blog here and here. In this paper we tried to formalize the notion of “a variational method” for regularization with the goal to obtain necessary conditions for a variational scheme to be regularizing. As expected, this goal is quite ambitions and we can not claim that we came up with ultimate necessary condition which describe what kind of variational methods are not possible. However, we could first relate the three kinds of variational methods (which I called Tikhonov, Morozov and Ivanov regularization here) and moreover investigated the conditions on the data space a little closer. In recent years it turned out that one should not always use a term like {\|Ku-g^\delta\|^2} to measure the noise or to penalize the deviation from {Ku} to {g^\delta}. For several noise models (like Poisson noise or multiplicative noise) other functionals are better suited. However, these functionals raise several issues: They are often not defined on a linear space but on a convex set, sometimes with the nasty property that their interior is empty. They often do not have convenient algebraic properties (e.g. scaling invariance, triangle inequalities or the like). Finally they are not necessarily (lower semi-)continuous with respect to the usual topologies. Hence, we approached the data space from quite abstract way: The data space {(Y,\tau_Y)} is topological space which comes with an additional sequential convergence structure {\mathcal{S}} (see e.g. here) and on (a subset of) which there is a discrepancy functional {\rho:Y\times Y\rightarrow [0,\infty]}. Then we analyzed the interplay of these three things {\tau_Y}, {\mathcal{S}} and {\rho}. If you wonder why we use the additional sequential convergence structure, remember that in the (by now classical) setting for Tikhonov regularization in Banach spaces with a functional like

\displaystyle  \|Ku-g^\delta\|_Y^q + \alpha\|u\|_X^p

with some Banach space norms {\|\cdot\|_Y} and {\|\cdot\|_X} there are also two kinds of convergence on {Y}: The weak convergence (which is replaced by {\tau_Y} in our setting) which is, e.g., used to describe convenient (lower semi-)continuity properties of {K} and the norm {\|\cdot\|_Y} and the norm convergence which is used to describe that {g^\delta\rightarrow g^\dag} for {\delta\rightarrow 0}. And since we do not have a normed space {Y} in our setting and one does not use any topological properties of the norm convergence in all the proofs of regularizing properties, Nadja suggested to use a sequential convergence structure instead.

Today I would like to comment on two arxiv-preprints I stumbled upon:

1. “Augmented L1 and Nuclear-Norm Models with a Globally Linearly Convergent Algorithm” – The Elastic Net rediscovered

The paper “Augmented L1 and Nuclear-Norm Models with a Globally Linearly Convergent Algorithm” by Ming-Jun Lai and Wotao Yin is another contribution to a field which is (or was?) probably the fastest growing field in applied mathematics: Algorithms for convex problems with non-smooth {\ell^1}-like terms. The “mother problem” here is as follows: Consider a matrix {A\in{\mathbb R}^{m\times n}}, {b\in{\mathbb R}^m} try to find a solution of

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1\quad\text{s.t.}\quad Ax=b

or, for {\sigma>0}

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1\quad\text{s.t.}\quad \|Ax-b\|\leq\sigma

which appeared here on this blog previously. Although this is a convex problem and even has a reformulation as linear program, some instances of this problem are notoriously hard to solve and gained a lot of attention (because their applicability in sparse recovery and compressed sensing). Very roughly speaking, a part of its hardness comes from the fact that the problem is neither smooth nor strictly convex.

The contribution of Lai and Yin is that they analyze a slight perturbation of the problem which makes its solution much easier: They add another term in the objective; for {\alpha>0} they consider

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1 + \frac{1}{2\alpha}\|x\|_2^2\quad\text{s.t.}\quad Ax=b

or

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1 + \frac{1}{2\alpha}\|x\|_2^2\quad\text{s.t.}\quad \|Ax-b\|\leq\sigma.

This perturbation does not make the problem smooth but renders it strongly convex (which usually makes the dual more smooth). It turns out that this perturbation makes life with this problem (and related ones) much easier – recovery guarantees still exists and algorithms behave better.

I think it is important to note that the “augmentation” of the {\ell^1} objective with an additional squared {\ell^2}-term goes back to Zou and Hastie from the statistics community. There, the motivation was as follows: They observed that the pure {\ell^1} objective tends to “overpromote” sparsity in the sense that if there are two columns in {A} which are almost equally good in explaining some component of {b} then only one of them is used. The “augmented problem”, however, tends to use both of them. They coined the method as “elastic net” (for reasons which I never really got).

I also worked on elastic-net problems for problems in the form

\displaystyle  \min_x \frac{1}{2}\|Ax-b\|^2 + \alpha\|x\|_1 + \frac{\beta}{2}\|x\|_2^2

in this paper (doi-link). Here it also turns out that the problem gets much easier algorithmically. I found it very convenient to rewrite the elastic-net problem as

\displaystyle  \min_x \frac{1}{2}\|\begin{bmatrix}A\\ \sqrt{\beta} I\end{bmatrix}x-\begin{bmatrix}b\\ 0\end{bmatrix}\|^2 + \alpha\|x\|_1

which turns the elastic-net problem into just another {\ell^1}-penalized problem with a special matrix and right hand side. Quite convenient for analysis and also somehow algorithmically.

2. Towards a Mathematical Theory of Super-Resolution

The second preprint is “Towards a Mathematical Theory of Super-Resolution” by Emmanuel Candes and Carlos Fernandez-Granda.

The idea of super-resolution seems to pretty old and, very roughly speaking, is to extract a higher resolution of a measured quantity (e.g. an image) than the measured data allows. Of course, in this formulation this is impossible. But often one can gain something by additional knowledge of the image. Basically, this also is the idea behind compressed sensing and hence, it does not come as a surprise that the results in compressed sensing are used to try to explain when super-resolution is possible.

The paper by Candes and Fernandez-Granada seems to be pretty close in spirit to Exact Reconstruction using Support Pursuit on which I blogged earlier. They model the sparse signal as a Radon measure, especially as a sum of Diracs. However, different from the support-pursuit-paper they use complex exponentials (in contrast to real polynomials). Their reconstruction method is basically the same as support pursuit: The try to solve

\displaystyle   \min_{x\in\mathcal{M}} \|x\|\quad\text{s.t.}\quad Fx=y, \ \ \ \ \ (1)

i.e. they minimize over the set of Radon measures {\mathcal{M}} under the constraint that certain measurements {Fx\in{\mathbb R}^n} result in certain given values {y}. Moreover, they make a thorough analysis of what is “reconstructable” by their ansatz and obtain a lower bound on the distance of two Diracs (in other words, a lower bound in the Prokhorov distance). I have to admit that I do not share one of their claims from the abstract: “We show that one can super-resolve these point sources with infinite precision—i.e. recover the exact locations and amplitudes—by solving a simple convex program.” My point is that I can not see to what extend the problem (1) is a simple one. Well, it is convex, but it does not seem to be simple.

I want to add that the idea of “continuous sparse modelling” in the space of signed measures is very appealing to me and appeared first in Inverse problems in spaces of measures by Kristian Bredies and Hanna Pikkarainen.

How many samples are needed to reconstruct a sparse signal?

Well, there are many, many results around some of which you probably know (at least if you are following this blog or this one). Today I write about a neat result which I found quite some time ago on reconstruction of nonnegative sparse signals from a semi-continuous perspective.

1. From discrete sparse reconstruction/compressed sensing to semi-continuous

The basic sparse reconstruction problem asks the following: Say we have a vector {x\in{\mathbb R}^m} which only has {s<m} non-zero entries and a fat matrix {A\in{\mathbb R}^{n\times m}} (i.e. {n>m}) and consider that we are given measurements {b=Ax}. Of course, the system {Ax=b} is underdetermined. However, we may add a little more prior knowledge on the solution and ask: Is is possible to reconstruct {x} from {b} if we know that the vector {x} is sparse? If yes: How? Under what conditions on {m}, {s}, {n} and {A}? This question created the expanding universe of compressed sensing recently (and this universe is expanding so fast that for sure there has to be some dark energy in it). As a matter of fact, a powerful method to obtain sparse solutions to underdetermined systems is {\ell^1}-minimization a.k.a. Basis Pursuit on which I blogged recently: Solve

\displaystyle  \min_x \|x\|_1\ \text{s.t.}\ Ax=b

and the important ingredient here is the {\ell^1}-norm of the vector in the objective function.

In this post I’ll formulate semi-continuous sparse reconstruction. We move from an {m}-vector {x} to a finite signed measure {\mu} on a closed interval (which we assume to be {I=[-1,1]} for simplicty). We may embed the {m}-vectors into the space of finite signed measures by choosing {m} points {t_i}, {i=1,\dots, m} from the interval {I} and build {\mu = \sum_{i=1}^m x_i \delta_{t_i}} with the point-masses (or Dirac measures) {\delta_{t_i}}. To a be a bit more precise, we speak about the space {\mathfrak{M}} of Radon measures on {I}, which are defined on the Borel {\sigma}-algebra of {I} and are finite. Radon measures are not very scary objects and an intuitive way to think of them is to use Riesz representation: Every Radon measure arises as a continuous linear functional on a space of continuous functions, namely the space {C_0(I)} which is the closure of the continuous functions with compact support in {{]{-1,1}[}} with respect to the supremum norm. Hence, Radon measures work on these functions as {\int_I fd\mu}. It is also natural to speak of the support {\text{supp}(\mu)} of a Radon measure {\mu} and it holds for any continuous function {f} that

\displaystyle  \int_I f d\mu = \int_{\text{supp}(\mu)}f d\mu.

An important tool for Radon measures is the Hahn-Jordan decomposition which decomposes {\mu} into a positive part {\mu^+} and a negative part {\mu^-}, i.e. {\mu^+} and {\mu^-} are non-negative and {\mu = \mu^+-\mu^-}. Finally the variation of a measure, which is

\displaystyle  \|\mu\| = \mu^+(I) + \mu^-(I)

provides a norm on the space of Radon measures.

Example 1 For the measure {\mu = \sum_{i=1}^m x_i \delta_{t_i}} one readily calculates that

\displaystyle  \mu^+ = \sum_i \max(0,x_i)\delta_{t_i},\quad \mu^- = \sum_i \max(0,-x_i)\delta_{t_i}

and hence

\displaystyle  \|\mu\| = \sum_i |x_i| = \|x\|_1.

In this sense, the space of Radon measures provides a generalization of {\ell^1}.

We may sample a Radon measure {\mu} with {n+1} linear functionals and these can be encoded by {n+1} continuous functions {u_0,\dots,u_n} as

\displaystyle  b_k = \int_I u_k d\mu.

This sampling gives a bounded linear operator {K:\mathfrak{M}\rightarrow {\mathbb R}^{n+1}}. The generalization of Basis Pursuit is then given by

\displaystyle  \min_{\mu\in\mathfrak{M}} \|\mu\|\ \text{s.t.}\ K\mu = b.

This was introduced and called “Support Pursuit” in the preprint Exact Reconstruction using Support Pursuit by Yohann de Castro and Frabrice Gamboa.

More on the motivation and the use of Radon measures for sparsity can be found in Inverse problems in spaces of measures by Kristian Bredies and Hanna Pikkarainen.

2. Exact reconstruction of sparse nonnegative Radon measures

Before I talk about the results we may count the degrees of freedom a sparse Radon measure has: If {\mu = \sum_{i=1}^s x_i \delta_{t_i}} with some {s} than {\mu} is defined by the {s} weights {x_i} and the {s} positions {t_i}. Hence, we expect that at least {2s} linear measurements should be necessary to reconstruct {\mu}. Surprisingly, this is almost enough if we know that the measure is nonnegative! We only need one more measurement, that is {2s+1} and moreover, we can take fairly simple measurements, namely the monomials: {u_i(t) = t^i} {i=0,\dots, n} (with the convention that {u_0(t)\equiv 1}). This is shown in the following theorem by de Castro and Gamboa.

Theorem 1 Let {\mu = \sum_{i=1}^s x_i\delta_{t_i}} with {x_i\geq 0}, {n=2s} and let {u_i}, {i=0,\dots n} be the monomials as above. Define {b_i = \int_I u_i(t)d\mu}. Then {\mu} is the unique solution of the support pursuit problem, that is of

\displaystyle  \min \|\nu\|\ \text{s.t.}\ K\nu = b.\qquad \textup{(SP)}

Proof: The following polynomial will be of importance: For a constant {c>0} define

\displaystyle  P(t) = 1 - c \prod_{i=1}^s (t-t_i)^2.

The following properties of {P} will be used:

  1. {P(t_i) = 1} for {i=1,\dots,s}
  2. {P} has degree {n=2s} and hence, is a linear combination of the {u_i}, {i=0,\dots,n}, i.e. {P = \sum_{k=0}^n a_k u_k}.
  3. For {c} small enough it holds for {t\neq t_i} that {|P(t)|<1}.

Now let {\sigma} be a solution of (SP). We have to show that {\|\mu\|\leq \|\sigma\|}. Due to property 2 we know that

\displaystyle  \int_I u_k d\sigma = (K\sigma)k = b_k = \int_I u_k d\mu.

Due to property 1 and non-negativity of {\mu} we conclude that

\displaystyle  \begin{array}{rcl}  \|\mu\| & = & \sum_{i=1}^s x_i = \int_I P d\mu\\ & = & \int_I \sum_{k=0}^n a_k u_k d\mu\\ & = & \sum_{k=0}^n a_k \int_I u_k d\mu\\ & = & \sum_{k=0}^n a_k \int_I u_k d\sigma\\ & = & \int_I P d\sigma. \end{array}

Moreover, by Lebesgue’s decomposition we can decompose {\sigma} with respect to {\mu} such that

\displaystyle  \sigma = \underbrace{\sum_{i=1}^s y_i\delta_{t_i}}_{=\sigma_1} + \sigma_2

and {\sigma_2} is singular with respect to {\mu}. We get

\displaystyle  \begin{array}{rcl}  \int_I P d\sigma = \sum_{i=1}^s y_i + \int P d\sigma_2 \leq \|\sigma_1\| + \|\sigma_2\|=\|\sigma\| \end{array}

and we conclude that {\|\sigma\| = \|\mu\|} and especially {\int_I P d\sigma_2 = \|\sigma_2\|}. This shows that {\mu} is a solution to {(SP)}. It remains to show uniqueness. We show the following: If there is a {\nu\in\mathfrak{M}} with support in {I\setminus\{x_1,\dots,x_s\}} such that {\int_I Pd\nu = \|\nu\|}, then {\nu=0}. To see this, we build, for any {r>0}, the sets

\displaystyle  \Omega_r = [-1,1]\setminus \bigcup_{i=1}^s ]x_i-r,x_i+r[.

and assume that there exists {r>0} such that {\|\nu|_{\Omega_r}\|\neq 0} ({\nu|_{\Omega_r}} denoting the restriction of {\nu} to {\Omega_r}). However, it holds by property 3 of {P} that

\displaystyle  \int_{\Omega_r} P d\nu < \|\nu|_{\Omega_r}\|

and consequently

\displaystyle  \begin{array}{rcl}  \|\nu\| &=& \int Pd\nu = \int_{\Omega_r} Pd\nu + \int_{\Omega_r^C} P d\nu\\ &<& \|\nu|_{\Omega_r}\| + \|\nu|_{\Omega_r^C}\| = \|\nu\| \end{array}

which is a contradiction. Hence, {\nu|_{\Omega_r}=0} for all {r} and this implies {\nu=0}. Since {\sigma_2} has its support in {I\setminus\{x_1,\dots,x_s\}} we conclude that {\sigma_2=0}. Hence the support of {\sigma} is exactly {\{x_1,\dots,x_s\}}. and since {K\sigma = b = K\mu} and hence {K(\sigma-\mu) = 0}. This can be written as a Vandermonde system

\displaystyle  \begin{pmatrix} u_0(t_1)& \dots &u_0(t_s)\\ \vdots & & \vdots\\ u_n(t_1)& \dots & u(t_s) \end{pmatrix} \begin{pmatrix} y_1 - x_1\\ \vdots\\ y_s - x_s \end{pmatrix} = 0

which only has the zero solution, giving {y_i=x_i}. \Box

3. Generalization to other measurements

The measurement by monomials may sound a bit unusual. However, de Castro and Gamboa show more. What really matters here is that the monomials for a so-called Chebyshev-System (or Tchebyscheff-system or T-system – by the way, have you ever tried to google for a T-system?). This is explained, for example in the book “Tchebycheff Systems: With Applications in Analysis and Statistics” by Karlin and Studden. A T-system on {I} is simply a set of {n+1} functions {\{u_0,\dots, u_n\}} such that any linear combination of these functions has at most {n} zeros. These systems are called after Tchebyscheff since they obey many of the helpful properties of the Tchebyscheff-polynomials.

What is helpful in our context is the following theorem of Krein:

Theorem 2 (Krein) If {\{u_0,\dots,u_n\}} is a T-system for {I}, {k\leq n/2} and {t_1,\dots,t_k} are in the interior of {I}, then there exists a linear combination {\sum_{k=0}^n a_k u_k} which is non-negative and vanishes exactly the the point {t_i}.

Now consider that we replace the monomials in Theorem~1 by a T-system. You recognize that Krein’s Theorem allows to construct a “generalized polynomial” which fulfills the same requirements than the polynomial {P} is the proof of Theorem~1 as soon as the constant function 1 lies in the span of the T-system and indeed the result of Theorem~1 is also valid in that case.

4. Exact reconstruction of {s}-sparse nonnegative vectors from {2s+1} measurements

From the above one can deduce a reconstruction result for {s}-sparse vectors and I quote Theorem 2.4 from Exact Reconstruction using Support Pursuit:

Theorem 3 Let {n}, {m}, {s} be integers such that {s\leq \min(n/2,m)} and let {\{1,u_1,\dots,u_n\}} be a complete T-system on {I} (that is, {\{1,u_1,\dots,u_r\}} is a T-system on {I} for all {r<n}). Then it holds: For any distinct reals {t_1,\dots,t_m} and {A} defined as

\displaystyle  A=\begin{pmatrix} 1 & \dots & 1\\ u_1(t_1)& \dots &u_1(t_m)\\ \vdots & & \vdots\\ u_n(t_1)& \dots & u(t_m) \end{pmatrix}

Basis Pursuit recovers all nonnegative {s}-sparse vectors {x\in{\mathbb R}^m}.

5. Concluding remarks

Note that Theorem~3 gives a deterministic construction of a measurement matrix.

Also note, that nonnegativity is crucial in what we did here. This allowed (in the monomial case) to work with squares and obtain the polynomial {P} in the proof of Theorem~1 (which is also called “dual certificate” in this context). This raises the question how this method can be adapted to all sparse signals. One needs (in the monomial case) a polynomial which is bounded by 1 but matches the signs of the measure on its support. While this can be done (I think) for polynomials it seems difficult to obtain a generalization of Krein’s Theorem to this case…

On my way to ENUMATH 11 in Leicester I stumbled upon the preprint Multi-parameter Tikhonov Regularisation in Topological Spaces by Markus Grasmair. The paper deals with fairly general Tikhonov functionals and its regularizing properties. Markus considers (nonlinear) operators {F:X\rightarrow Y} between two set {X} and {Y} and analyzes minimizers of the functional

\displaystyle T(x) = S(F(x),y) + \sum_k \alpha_k R_k(x).

The functionals {S} and {R_k} play the roles of a similarity measure and regularization terms, respectively. While he also treats the issue of noise in the operator and the multiple regularization terms, I was mostly interested in his approach to the general similarity measure. The category in which he works in that of topological spaces and he writes:

“Because anyway no trace of an original Hilbert space or Banach space structure is left in the formulation of the Tikhonov functional {T} [...], we will completely discard all assumption of a linear structure and instead consider the situation, where both the domain {X} and the co-domain {Y} of the operator {F} are mere topological spaces, with the topology of {Y} defined by the distance measure {S}.”

The last part of the sentence is important since previous papers often worked the other way round: Assume some topology in {Y} and then state conditions on {S}. Nadja Worliczek observed in her talk “Sparse Regularization with Bregman Discrepancy” at GAMM 2011 that it seems more natural to deduce the topology from the similarity measure and Markus took the same approach. While Nadja used the notion of “initial topology” (that is, take the coarsest topology that makes the functionals {y\mapsto S(z,y)} continuous), Markus uses the following family of pseudo-metrics: For {z\in Y} define

\displaystyle d^{(z)}(y,\tilde y) = |S(z,y)-S(z,\tilde y)|.

Unfortunately, the preprint is a little bit too brief for me at this point and I did not totally get what he means with “the topology {\sigma} induced by the uniformity induced by the pseudo-metric”. Also, I am not totally sure if “pseudo-metric” is unambiguous.. However, the topology he has in mind seems to be well suited in the sense that {y^n\rightarrow y} if {S(z,y^n)\rightarrow S(z,y)} for all {z}. Moreover, the condition that {S(z,y)=0} iff {z=y} implies that {\sigma} is Hausdorff. It would be good to have a better understanding on how the properties of the similarity measure are related to the properties of the induced topology. Are there examples in which the induced topology is both different from usual norm and weak topologies and also interesting?

Moreover, I would be interested, in the relations of the two approaches: via “uniformities” and the initial topology…

After a very refreshing and enjoying summer vacation I am back to work and back to blogging. This week there is the ILAS 11 (where ILAS stands for International Linear Algebra Society) here at TU Braunschweig; and since the talks take place in the building just next to where I am, I enjoyed some of them. Two talks have been especially interesting to me.

The first one was Tikhonov Regularization for Large Scale Inverse Problems by Melina Freitag (maybe the first link is not working yet but Melina said that she is going to upload her slides under that address). She talked about the ways the weather forecast is done these days in the UK and in Europe and especially on the process of Data Assimilation where one uses the previous weather forecasts and newly arrived measurements to produce a better estimate of the state. As a matter of fact, two popular methods use in this field (3dVar and 4dVar) are equivalent to classical Tikhonov regularization. In her section on {L^1}-penalties (which I usually call {\ell^1}-penalties…) she actually introduced a kind of discrete {TV}-penalty as a replacement for the usual quadratic ({\ell^2}) penalty. Her motivation was as usual: Tikhonov regularization smoothes too much and weather fronts are not smooth. She did not have results of this kind of {TV} regularization as a replacement in 4dVar with real weather data but with smaller toy examples (with non-linear advection equations) since the resulting optimization problem is LARGE. However, her results look promising. I am not sure if she did, but one was tempted to arrive at the conclusion that “4dVar with {TV} penalty gives a better resolution of weather fronts”. It happened that during her talk there was thunderstorm with heavy rain in front of the windows which has not been predicted by the forecast (according to which, the thunderstorm should happen the next day). Now: Would a {TV} penalty be able to predict this thunderstorm for the right time? I am not sure. While {TV} penalties do enforce edges, the precise place of the edge is still not too sure. My feeling is, that the accuracy of the position is better, the less the curvature of the edge is, but in general this highly depends on the ill-posed problem at hand.

The second talk was Recent Progress in the Solution of Discrete Ill-Posed Problems by Michiel Hochstenbach. The talk was pretty amusing; I especially liked the slogan for discrete ill-posed problems:

How to wisely divide by zero.

Also he introduced the three forms of variational regularization which I usually call Tikhonov, Ivanov and Morozov regularization (on slide 34) and introduced the Pareto front (under the name it usually has in discrete ill-posed problems: L-curve).

Another appealing slogan was:

Discrete ill-posed problems are essentially underdetermined.

(Since we do not want to solve the respective equation exactly, we usually have a lot of approximate solution of which we have to choose the one we like the most. Of course this is the same for continuous ill-posed problems.) As a consequence: One should use as much prior knowledge as possible.

In the rest of his talk he talked about improvements of the classical Tikhonov regularization (which is a linear method) by other linear methods with respect to both reconstruction quality and computational burden. He motivated his SRSVD (subspace restricted SVD) by the observation that the simple truncated SVD is often failing due to the fact that the first singular vectors do not contain useful information of the solution. He proposed to use a selected set of (orthonormal) vectors and use a restricted SVD in the following. However, does this use the knowledge that the solution shall be a linear combination of these selected vectors? And wouldn’t it be beneficial to use a larger set of (non-orthonormal) vectors, put into a (possible overcomplete) dictionary and use a sparse reconstruction method such a {\ell^1} regularization which automatically selects the relevant vectors? He also proposed the so called “linear combination approach” which basically takes several outputs of various methods and search within the linear combinations of this outputs for a better solution. To do so he proposed to use another Ivanov-type regularization (slide 79). I still did not get why he uses the largest available norm as a constraint here… However there should be an answer somewhere is his papers.

Edit: Michiel sent me the following answer:

I used the largest available norm, since the norms of many solution approaches are often smaller than, or approximately equal to the true norm. In the paper Discrete ill-posed least-squares problems with a solution norm constraint we reach the conclusion that
“For the approach based on the solution norm constraint, it seems important that \|x\| not be underestimated.”

Edit: The above mentioned paper Discrete ill-posed least-squares problems with a solution norm constraint is to be published in “Linear Algebra and its Applications”. It can be found via its doi.

1. A numerical experiment on sparse regularization

To start, I take a standard problem from the Regularization Tools Matlab toolbox: The problem \texttt{deriv2}. This problem generates a matrix {A} and two vectors {x} and {b} such that the equation {Ax=b} is a Galerkin discretization of the integral equation

\displaystyle  g(t) = \int_0^1 k(t,s)f(s) ds

with a kernel {k} such that the solution amounts to solving a boundary value problem. The Galerkin ansatz functions are simply orthonormal characteristic functions on intervals, i.e. {\psi_i(x) = h^{-1/2}\chi_{[ih,(i+1)h]}(x)}. Thus, I work with matrices {A_h} and vectors {x_h} and {b_h}.

I want to use sparse regularization to reconstruct spiky solutions, that is, I solve problems

\displaystyle  \min_{x_h} \tfrac{1}{2}\|A_h x_x - b_h\|^2 + \alpha_h\|x_h\|_1.

Now, my first experiment goes as follows:

Experiment 1 (Discretization goes to zero)
I generate spiky data: I fix a point {t_0} in the interval {[0,1]}, namely {t_0 = 0.2}, and a value {a_0=1}. Now I consider the data {f} which is a delta peak of height {a_0} and {t_0} (which in turn leads to a right hand side {g}). I construct the corresponding {x_h} and the right hand side {b_h=A_hx_h}. Now I aim at solving

\displaystyle  \min_f \tfrac{1}{2}\| g - \int_0^1 k(\cdot,s)f(s)ds\|_2^2 + \alpha \|f\|_1

for different discretizations ({h\rightarrow 0}). In the numerics, I have to scale {\alpha} with {h}, i.e. I solve

\displaystyle  \min_{x_h} \tfrac{1}{2}\|A_h x_x - b_h\|^2 + h\,\alpha\|x_h\|_1.

and I obtain the following results: In black I show the data {x}, {b} and so on, and in blue I plot the minimizer and its image under {A}.

For {n=10}:

For {n=50}:

For {n=100}:

For {n=500}:

For {n=1000}:

Note that the scale varies in the pictures, except in the lower left one where I show the discretized {g}. As is should be, this converges nicely to a piecewise linear function. However, the discretization of the solution blows up which is also as it should be, since I discretize a delta peak. Well, this basically shows, that my scaling is correct.

From the paper Sparse regularization with {\ell^q} penalty term one can extract the following result.

Theorem 1 Let {K:\ell^2\rightarrow Y} be linear, bounded and injective and let {u^\dagger \in \ell^2} have finite support. Moreover let {g^\dagger = Ku^\dagger} and {\|g^\dagger-g^\delta\|\leq \delta}. Furthermore, denote with {u_\alpha^\delta} the minimizer of

\displaystyle  \tfrac12\|Ku-g^\delta\|^2 + \alpha\|u\|_1.

Then, for {\alpha = c\delta} it holds that

\displaystyle  \|u_\alpha^\delta - u^\dagger\|_1 = \mathcal{O}(\delta).

Now let’s observe this convergence rates in a second experiment:

Experiment 2 (Convergence rate {\mathcal{O}(\delta)}) Now we fix the discretization (i.e. {n=500}), and construct a series of {g^\delta}‘s for {\delta} in a logscale between {1} and {10^{-6}}. I scale {\alpha} proportional to {\delta} and caluclate minimizers of

\displaystyle  \min_{x_h} \tfrac{1}{2}\|A_h x_x - b_h\|^2 + h\,\alpha\|x_h\|_1.

The I measure the error {\|f_\alpha^\delta-f^\dagger\|_1} and plot it doubly logarithmically against {\delta}.

And there you see the linear convergence rate as predicted.

In a final experiment I vary both {\delta} and {n}:

Experiment 3 ({\delta\rightarrow 0} and “{n\rightarrow\infty}”)Now we repeat Experiment 1 for different {n} and put all the loglog plots in one figure. This looks like this: You clearly observe the linear convergence rate in any case. But there is another important thing: The larger the {n} (i.e. the smaller the {h}), the later the linear rate kicks is (i.e. for smaller {\delta}). You may wonder what the reason is. By looking at the reconstruction for varying {n} and {\delta} (which I do not show here) you see the following behavior: For large noise the regularized solutions consist of several peaks located all over the place and with vanishing noise, one peak close to the original one gets dominant. However, this peak is not at the exact position, but at a slightly larger {t}; moreover, it is slightly smaller. Then, this peak moves towards the right position and is also getting larger. Finally, the peak arrives at the exact position and remains there while approaching the correct height.

Hence, the linear rate kicks in, precisely when the accuracy is higher than the discretization level.

Conclusion:

  • The linear convergence rate is only present in the discrete case. Moreover, it starts at a level which can not be resolved by the discretization.
  • “Sparsity penalties” in the continuous case are a different and delicate matter. You may consult the preprint “Inverse problems in spaces of measures” which formulates the sparse recovery problem in a continuous setting but in the space of Radon measures rather than in {L^1} (which is simply not working). There Kristian and Hanna show weak* convergence of the minimizers.
  • Finally, for “continuous sparsity” also some kind of convergence is true, however, not in norm (which really should be the variation norm in measure space). Weak* convergence can be quantified by the Prokhorov metric or the Wasserstein metric (which is also called earth movers distance in some comiunities). Convergence with respect to these metric should be true (under some assumptions) but seem hard to prove. Convergence rates would be cool, but seem even harder.

I used to work on “non-convex” regularization with {\ell^p}-penalties, that is, studying the Tikhonov functional

\displaystyle \frac12 \|Ax-b\|_2^2 + \alpha\sum_{i}|x_i|^p \ \ \ \ \ (1)

 

with a linear operator {A} and {0<p<1}.

The regularization properties are quite nice as shown by Markus Grasmair in “Well-posedness and convergence rates for sparse regularization with sublinear {l^q} penalty term” and “Non-convex sparse regularisation” and Kristian Bredies and myself in “Regularization with non-convex separable constraints”.

The next important issue is, to have some way to calculate global minimizers for~(1). But, well, this task may be hard, if not hopeless: Of course one expects a whole lot of local minimizers.

It is quite instructive to consider the simple case in which {A} is the identity first:

Example 1 Consider the minimization of

\displaystyle F(x) = \frac12\|x-b\|_2^2 + \alpha\sum_i |x_i|^p. \ \ \ \ \ (2)

 

This problem separates over the coordinates and hence, can be solved by solving the one-dimensional minimization problem

\displaystyle s^*\in\textup{arg}\min_s \frac12 (s-b)^2 + \alpha|s|^p. \ \ \ \ \ (3)

 

We observe:

  • For {b\geq 0} we get {s^*\geq 0}.
  • Replacing {b} by {-b} leads to {-s^*} instead of {s^*}.

Hence, we can reduce the problem to: For {b\geq 0} find

\displaystyle s^* \in\textup{arg}\min_{s\geq 0} \frac12 (s-b)^2 + \alpha\, s^p. \ \ \ \ \ (4)

 

One local minimizer is always {s^*=0} since the growth of the {p}-th power beats the term {(\cdot-b)^2}. Then, {b} is large enough, there are two more extrema for~(4) which are given as the solutions to

\displaystyle s + \alpha p s^{p-1} = b

one of which is a local maximum (the one which is smaller in magnitude) and the other is a local minimum (the one which is larger in magnitude). This is illustrated in the following “bifurcation” picture:

Now the challenge is, to find out which local minimum has the smaller value. And here a strange thing happens: The “switching point” for {b} at which the global minimizer jumps from {0} to the upper branch of the (multivalued) inverse of {s\mapsto s + \alpha p s^{p-1}} is not at the place at which the second local minimum occurs. It is a little bit larger: In “Convergence rates and source conditions for Tikhonov regularization with sparsity constraints” I calculated this “jumping point” the as the weird expression

\displaystyle b^* = \frac{2-p}{2-2p}\Bigl(2\alpha(1-p)\Bigr)^{\frac{1}{2-p}}.

This leads to the following picture of the mapping

\displaystyle b^\mapsto \textup{arg}\min_s \frac12 (s-b)^2 + \alpha|s|^p

1. Iterative re-weighting

One approach to calculate minimizers in~(1) is the so called iterative re-weighting, which appeared at least in “An unconstrained {\ell^q} minimization for sparse solution of under determined linear systems” by Ming-Jun Lai and Jingyue Wang but is probably older. I think for the problem with equality constraints

\displaystyle \min \|x\|_q\ \textup{ s.t. }\ Ax=b

the approach at least dates back to the 80s but I forgot the reference… For the minimization of (1) the approach goes as follows: For {0<p<1} choose a {q\geq 1} and a small {\epsilon>0} and rewrite the {p}-quasi-norm as

\displaystyle \sum_i |x_i|^p \approx \sum_i (\epsilon + |x_i|^q)^{\frac{p}{q}}.

The necessary condition for a minimizer of

\displaystyle \frac12\|Ax-b\|_2^2 + \alpha\sum_i (\epsilon + |x_i|^q)^{\frac{p}{q}}

is (formally take the derivative)

\displaystyle 0 = \alpha \Big[\frac{p}{q} (\epsilon + |x_i|^q)^{\frac{p}{q}-1} q \textup{sgn}(x_i) |x_i|^{q-1}\Big]_i + A^*(Ax-b)

Note that the exponent {\frac{p}{q}-1} is negative (which is also a reason for the introduction of the small {\epsilon}). Aiming at an iteration, we fix some of the {x}‘s and try to solve for others: If we have a current iterate {x^k} we try to find {x^{k+1}} by solving

\displaystyle 0 = \alpha \Big[\frac{p}{q} (\epsilon + |x_i^k|^q)^{\frac{p}{q}-1} q \textup{sgn}(x_i) |x_i|^{q-1}\Big]_i + A^*(Ax-b)

for {x}. This is the necessary condition for another minimization problem which involves a weighted {q}-norm: Define (non-negative) weights {w^k_i = \frac{p}{q} (\epsilon + |x^k_i|^p)^{\frac{p}{q}-1}} an iterate

\displaystyle x^{k+1}\in \textup{arg}\min_x \frac12\|Ax-b\|_2^2 + \alpha\sum_i w_i^k |x_i|^q. \ \ \ \ \ (5)

 

Lai and Wang do this for {q=2} which has the benefit that each iteration can be done by solving a linear system. However, for general {1\leq q\leq 2} each iteration is still a convex minimization problem. The paper “Convergence of Reweighted {\ell^1} Minimization Algorithms and Unique Solution of Truncated {\ell^p} Minimization” by Xiaojun Chen and Weijun Zhou uses {q=1} and both papers deliver some theoretical results of the iteration. Indeed in both cases one can show (subsequential) convergence to a critical point.

Of course the question arises if there is a chance that the limit will be a global minimizer. Unfortunately this is not probable as a simple numerical experiment shows:

Example 2 We apply the iteration (5) to the one dimensional problem (3) in which we know the solution. And we do this for many values of {b} and plot the value of {b} against the limit of the iteration. Good news first: Everything converges nicely to critical points as deserved. Even better: {\epsilon} can be really small—machine precision works. The bad news: The limit depends on the initial value. Even worse: The method frequently ends on “the wrong branch”, i.e. in the local minimum which is not global. I made the following experiment: I took {p=1/2}, set {\alpha=1} and chose {q=2}. First I initialized for all values of {b} with {s^0=1}. This produced the following output (I plotted every fifth iteration):

Well, the iteration always chose the upper branch… In a second experiment I initialized with a smaller value, namely with {s^0=0.1} for all {b}. This gave:

That’s interesting: I ended at the upper branch for all values below the point where the lower branch (the one with the local maximum) crosses the initialization line. This seems to be true in general. Starting with {s^0=0.05} gave
Well, probably this is not too interesting: Starting “below the local maximum” will bring you to the local minimum which is lower and vice versa. Indeed Lai and Wang proved in their Theorem 2.5 that for a specific setting ({A} of completely full rank, sparsity high enough) there is an {\alpha} small enough such that the method will pick the global minimizer. But wait—they do not say anything about initialization… What happens if we initialize with zero? I have to check…

By the way: A similar experiment as in this example with different values of {q\geq 1} showed the same behavior (getting the right branch if the initialization is ok). However: smaller {q} gave much faster convergence. But remember: For {q=1} (experimentally the fastest) each iteration is an {\ell^1} penalized problem while for {q=2} one has to solve a linear system. So there seems to be a tradeoff between “small number of iterations in IRLP” and “complexity of the subproblems”.

2. Iterative thresholding

Together with Kristian Bredies I developed another approach to these nasty non-convex minimization problems with {\ell^p}-quasi-norms. We wrote a preprint back in 2009 which is currently under revision. Moreover, we always worked in a Hilbert space setting that is {A} maps the sequence space {\ell^2} into a separable Hilbert space.

Remark 1 When showing result for problems in separable Hilbert space I sometimes get the impression that others think this is somehow pointless since in the end one always works with {{\mathbb R}^N} in practice. However, I think that working directly in a separable Hilbert space is preferable since then one can be sure that the results will not depend on the dimension {N} in any nasty way.

Basically our approach was, to use one of the most popular approaches to the {\ell^1}-penalized problem: Iterative thresholding aka forward-backward splitting aka generalized gradient projection. I prefer the last motivation: Consider the minimization of a smooth function {F} over a convex set {C}

\displaystyle \min_{x\in C} F(x)

by the projected gradient method. That is: do a gradient step and use the projection {P_C} to project back onto {C}:

\displaystyle x^{n+1} = P_C(x^n - s_n \nabla F(x^n)).

Now note that the projection is characterized by

\displaystyle P_C(x) = \textup{arg}\min_{y\in C}\frac{1}{2}\|y-x\|^2.

Now we replace the “convex constraint” {C} by a penalty function {\alpha R}, i.e. we want to solve

\displaystyle \min_x F(x) + \alpha R(x).

Then, we just replace the minimization problem for the projection with

\displaystyle P_s(x) = \textup{arg}\min_{y}\frac{1}{2}\|y-x\|^2 + s\alpha R(y)

and iterate

\displaystyle x^{n+1} = P_{s_n}(x^n - s_n \nabla F (x^n)).

The crucial thing is, that one needs global minimizers to obtain {P_s}. However, for these {\ell^p} penalties with {0<p<1} these are available as we have seem in Example~1. Hence, the algorithm can be applied in the case

\displaystyle F(x) = \tfrac{1}{2}\|Ax-y\|^2,\qquad R(x) = \sum_i |x_i|^p.

One easily proves that one gets descent of the objective functional:

Lemma 1 Let {F} be weakly lower semicontinuous and differentiable with Lipschitz continuous gradient {\nabla F} with Lipschitz constant {L} and let {R} be weakly lower semicontinuous and coercive. Furthermore let {P_s(x)} denote any solution of

\displaystyle \min_y \tfrac{1}{2}\|y-x\|^2 + s\alpha R(y).

Then for {y = P_s(x - s\nabla F(x))} it holds that

\displaystyle F(y) + \alpha R(y) \leq F(x) + \alpha R(x) - \tfrac{1}{2}\big(\tfrac{1}{s} - L\big)\|y-x\|^2.

Proof: Start with the minimizing property

\displaystyle \tfrac{1}{2}\|y - (x- s\nabla F(x))\|^2 + s\alpha R(y) \leq \tfrac{1}{2}\|s\nabla F(x)\|^2 + s\alpha R(x).

and conclude (by rearranging, expanding the norm-square, canceling terms and adding {F(y) - F(x)} to both sides) that

\displaystyle (F+\alpha R)(y) - (F+\alpha R)(x) \leq F(y) - F(x) - \langle \nabla F(x),y-x\rangle - \tfrac{1}{2s}\|y-x\|^2.

Finally, use Lipschitz-continuity of {\nabla F} to conclude

\displaystyle F(y) - F(x) - \langle \nabla F(x),y-x\rangle \leq \tfrac{L}{2}\|x-y\|^2.

\Box

This gives descent of the functional value as long as {0< s < 1/L}. Now starts the hard part of the investigation: Under what circumstances do we get convergence and what are possible limits?

To make a long story short: For {\ell^p}-penalties (and also other non-convex penalties which leave the origin with infinite slope) and fixed step-size {s_n=s} one gets that every subsequence of the iterates has a strong accumulation point which is a fixed point of the iteration for that particular {s} as long as {0< s< 1/L}. Well that’s good, but here’s the bad news: As long as {s<1/L} we do not obtain the global minimizer. That’s for sure: Consider {F(x) = \tfrac{1}{2}\|x-b\|^2} and any {0<s<1}

However, with considerably more effort one can show the following: For the iteration {x^{n+1} = P_{s_n}(x^n - s_n \nabla F(x))} with {s_n = (L + 1/n)^{-1}\rightarrow 1/L} (and another technical condition on the Lipschitz constant of {\nabla F}) the iterates have a strong accumulation point which is a solution {x = P_{1/L}(x - 1/L\,\nabla F(x)} and hence, satisfies necessary conditions for a global minimizer.

That’s not too bad yet. Currently Kristian and I, together with Stefan Reiterer, work to show that the whole sequence of iterates converges. Funny enough: This seems to be true for {F(x) = \tfrac{1}{2}\|Ax-b\|^2} and {R(x) = \sum_i |x_i|^p} with rational {p} in {]0,1[}… Basically, Stefan was able to show this with the help of Gröbner bases and this has been my first contact with this nice theory. We hope to finalize our revision soon.

Next Page »

Follow

Get every new post delivered to your Inbox.

Join 30 other followers