Let {\Omega} be a compact subset of {{\mathbb R}^d} and consider the space {C(\Omega)} of continuous functions {f:\Omega\rightarrow {\mathbb R}} with the usual supremum norm. The Riesz Representation Theorem states that the dual space of {C(\Omega)} is in this case the set of all Radon measures, denoted by {\mathfrak{M}(\Omega)} and the canonical duality pairing is given by

\displaystyle  \langle\mu,f\rangle = \mu(f) = \int_\Omega fd\mu.

We can equip {\mathfrak{M}(\Omega)} with the usual notion of weak* convergence which read as

\displaystyle  \mu_n\rightharpoonup^* \mu\ \iff\ \text{for every}\ f:\ \mu_n(f)\rightarrow\mu(f).

We call a measure {\mu} positive if {f\geq 0} implies that {\mu(f)\geq 0}. If a positive measure satisfies {\mu(1)=1} (i.e. it integrates the constant function with unit value to one), we call it a probability measure and we denote with {\Delta\subset \mathfrak{M}(\Omega)} the set of all probability measures.

Example 1 Every non-negative integrable function {\phi:\Omega\rightarrow{\mathbb R}} with {\int_\Omega \phi(x)dx} induces a probability measure via

\displaystyle  f\mapsto \int_\Omega f(x)\phi(x)dx.

Quite different probability measures are the {\delta}-measures: For every {x\in\Omega} there is the {\delta}-measure at this point, defined by

\displaystyle  \delta_x(f) = f(x).

In some sense, the set {\Delta} of probability measure is the generalization of the standard simplex in {{\mathbb R}^n} to infinite dimensions (in fact uncountably many dimensions): The {\delta}-measures are the extreme points of {\Delta} and since the set {\Delta} is compact in the weak* topology, the Krein-Milman Theorem states that {\Delta} is the weak*-closure of the set of convex combinations of the {\delta}-measures – similarly as the standard simplex in {{\mathbb R}^n} is the convex combination of the canonical basis vectors of {{\mathbb R}^n}.

Remark 1 If we drop the positivity assumption and form the set

\displaystyle  O = \{\mu\in\mathfrak{M}(\Omega)\ :\ |f|\leq 1\implies |\mu(f)|\leq 1\}

we have the {O} is the set of convex combinations of the measures {\pm\delta_x} ({x\in\Omega}). Hence, {O} resembles the hyper-octahedron (aka cross polytope or {\ell^1}-ball).

I’ve taken the above (with almost similar notation) from the book “ A Course in Convexity” by Alexander Barvinok. I was curious to find (in Chapter III, Section 9) something which reads as a nice glimpse on semi-continuous compressed sensing: Proposition 9.4 reads as follows

Proposition 1 Let {g,f_1,\dots,f_m\in C(\Omega)}, {b\in{\mathbb R}^m} and suppose that the subset {B} of {\Delta} consisting of the probability measures {\mu} such that for {i=1,\dots,m}

\displaystyle  \int f_id\mu = b_i

is not empty. Then there exists {\mu^+,\mu^-\in B} such that

  1. {\mu^+} and {\mu^-} are convex combinations of at most {m+1} {\delta}-measures, and
  2. it holds that for all {\mu\in B} we have

    \displaystyle  \mu^-(g)\leq \mu(g)\leq \mu^+(g).

In terms of compressed sensing this says: Among all probability measures which comply with the data {b} measured by {m} linear measurements, there are two extremal ones which consists of {m+1} {\delta}-measures.

Note that something similar to “support-pursuit” does not work here: The minimization problem {\min_{\mu\in B, \mu(f_i)=b_i}\|\mu\|_{\mathfrak{M}}} does not make much sense, since {\|\mu\|_{\mathfrak{M}}=1} for all {\mu\in B}.

Today I would like to comment on two arxiv-preprints I stumbled upon:

1. “Augmented L1 and Nuclear-Norm Models with a Globally Linearly Convergent Algorithm” – The Elastic Net rediscovered

The paper “Augmented L1 and Nuclear-Norm Models with a Globally Linearly Convergent Algorithm” by Ming-Jun Lai and Wotao Yin is another contribution to a field which is (or was?) probably the fastest growing field in applied mathematics: Algorithms for convex problems with non-smooth {\ell^1}-like terms. The “mother problem” here is as follows: Consider a matrix {A\in{\mathbb R}^{m\times n}}, {b\in{\mathbb R}^m} try to find a solution of

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1\quad\text{s.t.}\quad Ax=b

or, for {\sigma>0}

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1\quad\text{s.t.}\quad \|Ax-b\|\leq\sigma

which appeared here on this blog previously. Although this is a convex problem and even has a reformulation as linear program, some instances of this problem are notoriously hard to solve and gained a lot of attention (because their applicability in sparse recovery and compressed sensing). Very roughly speaking, a part of its hardness comes from the fact that the problem is neither smooth nor strictly convex.

The contribution of Lai and Yin is that they analyze a slight perturbation of the problem which makes its solution much easier: They add another term in the objective; for {\alpha>0} they consider

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1 + \frac{1}{2\alpha}\|x\|_2^2\quad\text{s.t.}\quad Ax=b

or

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1 + \frac{1}{2\alpha}\|x\|_2^2\quad\text{s.t.}\quad \|Ax-b\|\leq\sigma.

This perturbation does not make the problem smooth but renders it strongly convex (which usually makes the dual more smooth). It turns out that this perturbation makes life with this problem (and related ones) much easier – recovery guarantees still exists and algorithms behave better.

I think it is important to note that the “augmentation” of the {\ell^1} objective with an additional squared {\ell^2}-term goes back to Zou and Hastie from the statistics community. There, the motivation was as follows: They observed that the pure {\ell^1} objective tends to “overpromote” sparsity in the sense that if there are two columns in {A} which are almost equally good in explaining some component of {b} then only one of them is used. The “augmented problem”, however, tends to use both of them. They coined the method as “elastic net” (for reasons which I never really got).

I also worked on elastic-net problems for problems in the form

\displaystyle  \min_x \frac{1}{2}\|Ax-b\|^2 + \alpha\|x\|_1 + \frac{\beta}{2}\|x\|_2^2

in this paper (doi-link). Here it also turns out that the problem gets much easier algorithmically. I found it very convenient to rewrite the elastic-net problem as

\displaystyle  \min_x \frac{1}{2}\|\begin{bmatrix}A\\ \sqrt{\beta} I\end{bmatrix}x-\begin{bmatrix}b\\ 0\end{bmatrix}\|^2 + \alpha\|x\|_1

which turns the elastic-net problem into just another {\ell^1}-penalized problem with a special matrix and right hand side. Quite convenient for analysis and also somehow algorithmically.

2. Towards a Mathematical Theory of Super-Resolution

The second preprint is “Towards a Mathematical Theory of Super-Resolution” by Emmanuel Candes and Carlos Fernandez-Granda.

The idea of super-resolution seems to pretty old and, very roughly speaking, is to extract a higher resolution of a measured quantity (e.g. an image) than the measured data allows. Of course, in this formulation this is impossible. But often one can gain something by additional knowledge of the image. Basically, this also is the idea behind compressed sensing and hence, it does not come as a surprise that the results in compressed sensing are used to try to explain when super-resolution is possible.

The paper by Candes and Fernandez-Granada seems to be pretty close in spirit to Exact Reconstruction using Support Pursuit on which I blogged earlier. They model the sparse signal as a Radon measure, especially as a sum of Diracs. However, different from the support-pursuit-paper they use complex exponentials (in contrast to real polynomials). Their reconstruction method is basically the same as support pursuit: The try to solve

\displaystyle   \min_{x\in\mathcal{M}} \|x\|\quad\text{s.t.}\quad Fx=y, \ \ \ \ \ (1)

i.e. they minimize over the set of Radon measures {\mathcal{M}} under the constraint that certain measurements {Fx\in{\mathbb R}^n} result in certain given values {y}. Moreover, they make a thorough analysis of what is “reconstructable” by their ansatz and obtain a lower bound on the distance of two Diracs (in other words, a lower bound in the Prokhorov distance). I have to admit that I do not share one of their claims from the abstract: “We show that one can super-resolve these point sources with infinite precision—i.e. recover the exact locations and amplitudes—by solving a simple convex program.” My point is that I can not see to what extend the problem (1) is a simple one. Well, it is convex, but it does not seem to be simple.

I want to add that the idea of “continuous sparse modelling” in the space of signed measures is very appealing to me and appeared first in Inverse problems in spaces of measures by Kristian Bredies and Hanna Pikkarainen.

If you want to have a sparse solution to a linear system of equation and have heard of compressed sensing or sparse reconstruction than you probably know what to do: Get one of the many solvers for Basis Pursuit and be happy.

Basis Pursuit was designed as a convex approximation of the generally intractable problem of finding the sparsest solution (that is, the solution with the smallest number of non-zero entries). By abuse of notation, we define for {x\in\mathbb{R}^n}

\displaystyle \|x\|_0 = \#\{i\ : x_i\neq 0\}.

(Because of {\|x\|_0 = \lim_{p\rightarrow 0}\|x\|_p^p} some people prefer the, probably more correct but also more confusing, notation {\|x\|_0^0}…).

Then, the sparsest solution of {Ax=b} is the solution of

\displaystyle \min_x \|x\|_0,\quad \text{s.t.}\ Ax=b

and Basis Pursuit replaces {\|x\|_0} with “the closest convex proxy”, i.e.

\displaystyle \min_x \|x\|_1,\quad\text{s.t.}\ Ax=b.

The good thing about Basis Pursuit suit is, that is really gives the sparsest solution under appropriate conditions as is widely known nowadays. Here I’d like to present two simple examples in which the Basis Pursuit solution is

  • not even close to the sparsest solution (by norm).
  • not sparse at all.

1. A small bad matrix

We can build a bad matrix for Basis Pursuit, even in the case {2\times 3}: For a small {\epsilon>0} define

\displaystyle A = \begin{bmatrix} \epsilon & 1 & 0\\ \epsilon & 0 & 1 \end{bmatrix}, \quad b = \begin{bmatrix} 1\\1 \end{bmatrix}.

Of course, the sparsest solution is

\displaystyle x_0 = \begin{bmatrix} 1/\epsilon\\ 0\\ 0\end{bmatrix}

while the solution of Basis Pursuit is

\displaystyle x_1 = \begin{bmatrix} 0\\1\\1 \end{bmatrix}.

The summarize: For {\epsilon<1/2}

\displaystyle \|x_0\|_0 = 1 < 2 = \|x_1\|_0,\quad \|x_0\|_1 = 1/\epsilon > 2 = \|x_1\|_1.

(There is also a least squares solution that has three non-zero entries and a one-norm slightly larger than 2.)

Granted, this matrix is stupid. Especially, its first column has a very small norm compared to the others. Ok, let’s construct a matrix with normalized columns.

2. A small bad matrix with normalized columns

Fix an integer {n} and a small {\epsilon>0}. We define a {n\times(n+2)}-matrix

\displaystyle \begin{bmatrix} 1+\epsilon/2 & -1+\epsilon/2 & 1 & 0 & \dots & \dots &0\\ -1+\epsilon/2 & 1+\epsilon/2 & 0 & 1 & \ddots & & 0\\ \epsilon/2 & \epsilon/2 & \vdots & \ddots& \ddots & \ddots& \vdots\\ \vdots & \vdots & \vdots & & \ddots & \ddots& 0\\ \epsilon/2 & \epsilon/2 & 0 & \dots& \dots& 0 & 1 \end{bmatrix}.

Ok, the first two columns do not have norm 1 yet, so we normalize them by multiplying with the right constant

\displaystyle c = \frac{1}{\sqrt{2 + \tfrac{n\epsilon^2}{4}}}

(which is close to {1/\sqrt{2}}) to get

\displaystyle A = \begin{bmatrix} c(1+\epsilon/2) & c(-1+\epsilon/2) & 1 & 0 & \dots & \dots &0\\ c(-1+\epsilon/2) & c(1+\epsilon/2) & 0 & 1 & \ddots & & 0\\ c\epsilon/2 & c\epsilon/2 & \vdots & \ddots& \ddots & \ddots& \vdots\\ \vdots & \vdots & \vdots & & \ddots & \ddots& 0\\ c\epsilon/2 & c\epsilon/2 & 0 & \dots& \dots& 0 & 1 \end{bmatrix}.

Now we take the right hand side

\displaystyle b = \begin{bmatrix} 1\\\vdots\\1 \end{bmatrix}

and see what solutions to {Ax=b} are there.

First, there is the least squares solution {x_{\text{ls}} = A^\dagger b}. This has only non-zero entries, the last {n} entries are slightly smaller than {1} and the first two are between {0} and {1}, hence, {\|x_{\text{ls}}\|_1 \approx n} (in fact, slightly larger).

Second, there is a very sparse solution

\displaystyle x_0 = \frac{1}{\epsilon c} \begin{bmatrix} 1\\ 1\\ 0\\ \vdots\\ 0 \end{bmatrix}.

This has two non-zero entries and a pretty large one-norm: {\|x_0\|_1 = 2/(\epsilon c)}.

Third there is a solution with small one-norm:

\displaystyle x_1 = \begin{bmatrix} 0\\ 0\\ 1\\ \vdots\\ 1 \end{bmatrix}.

We have {n} non-zero entries and {\|x_1\|_1 = n}. You can check that this {x_1} is also the unique Basis Pursuit solution (e.g. by observing that {A^T[1,\dots,1]^T} is an element of {\partial\|x_1\|_1} and that the first two entries in {A^T[1,\dots,1]^T} are strictly smaller than 1 and positive – put differently, the vector {[1,\dots,1]^T} is dual certificate for {x_1}).

To summarize, for {\epsilon < \sqrt{\frac{8}{n^2-n}}} it holds that

\displaystyle \|x_0\|_0 = 2 < n = \|x_1\|_0,\quad \|x_0\|_1 = 2/(c\epsilon) > n = \|x_1\|_1.

The geometric idea behind this matrix is as follows: We take {n} simple normalized columns (the identity part in {A}) which sum up to the right hand side {b}. Then we take two normalized vectors which are almost orthogonal to {b} but have {b} in their span (but one needs huge factors here to obtain {b}).

Well, this matrix looks very artificial and indeed it’s constructed for one special purpose: To show that minimal {\ell^1}-norm solution are not always sparse (even when a sparse solution exists). It’s some kind of a hobby for me to construct instances for sparse reconstruction with extreme properties and I am thinking about a kind of “gallery” of these instances (probably extending the “gallery” command in Matlab).
By the way: if you want to play around with this matrix, here is the code

n = 100;
epsilon = sqrt(8/(n^2-n))+0.1;
c = 1/sqrt(2+n*epsilon^2/4);
A = zeros(n,n+2);
A(1:2,1:2) = ([1 -1;-1,1]+epsilon/2)*c;
A(3:n,1:2) = epsilon/2*c;
A(1:n,3:n+2) = eye(n);
b = ones(n,1);

How many samples are needed to reconstruct a sparse signal?

Well, there are many, many results around some of which you probably know (at least if you are following this blog or this one). Today I write about a neat result which I found quite some time ago on reconstruction of nonnegative sparse signals from a semi-continuous perspective.

1. From discrete sparse reconstruction/compressed sensing to semi-continuous

The basic sparse reconstruction problem asks the following: Say we have a vector {x\in{\mathbb R}^m} which only has {s<m} non-zero entries and a fat matrix {A\in{\mathbb R}^{n\times m}} (i.e. {n>m}) and consider that we are given measurements {b=Ax}. Of course, the system {Ax=b} is underdetermined. However, we may add a little more prior knowledge on the solution and ask: Is is possible to reconstruct {x} from {b} if we know that the vector {x} is sparse? If yes: How? Under what conditions on {m}, {s}, {n} and {A}? This question created the expanding universe of compressed sensing recently (and this universe is expanding so fast that for sure there has to be some dark energy in it). As a matter of fact, a powerful method to obtain sparse solutions to underdetermined systems is {\ell^1}-minimization a.k.a. Basis Pursuit on which I blogged recently: Solve

\displaystyle  \min_x \|x\|_1\ \text{s.t.}\ Ax=b

and the important ingredient here is the {\ell^1}-norm of the vector in the objective function.

In this post I’ll formulate semi-continuous sparse reconstruction. We move from an {m}-vector {x} to a finite signed measure {\mu} on a closed interval (which we assume to be {I=[-1,1]} for simplicty). We may embed the {m}-vectors into the space of finite signed measures by choosing {m} points {t_i}, {i=1,\dots, m} from the interval {I} and build {\mu = \sum_{i=1}^m x_i \delta_{t_i}} with the point-masses (or Dirac measures) {\delta_{t_i}}. To a be a bit more precise, we speak about the space {\mathfrak{M}} of Radon measures on {I}, which are defined on the Borel {\sigma}-algebra of {I} and are finite. Radon measures are not very scary objects and an intuitive way to think of them is to use Riesz representation: Every Radon measure arises as a continuous linear functional on a space of continuous functions, namely the space {C_0(I)} which is the closure of the continuous functions with compact support in {{]{-1,1}[}} with respect to the supremum norm. Hence, Radon measures work on these functions as {\int_I fd\mu}. It is also natural to speak of the support {\text{supp}(\mu)} of a Radon measure {\mu} and it holds for any continuous function {f} that

\displaystyle  \int_I f d\mu = \int_{\text{supp}(\mu)}f d\mu.

An important tool for Radon measures is the Hahn-Jordan decomposition which decomposes {\mu} into a positive part {\mu^+} and a negative part {\mu^-}, i.e. {\mu^+} and {\mu^-} are non-negative and {\mu = \mu^+-\mu^-}. Finally the variation of a measure, which is

\displaystyle  \|\mu\| = \mu^+(I) + \mu^-(I)

provides a norm on the space of Radon measures.

Example 1 For the measure {\mu = \sum_{i=1}^m x_i \delta_{t_i}} one readily calculates that

\displaystyle  \mu^+ = \sum_i \max(0,x_i)\delta_{t_i},\quad \mu^- = \sum_i \max(0,-x_i)\delta_{t_i}

and hence

\displaystyle  \|\mu\| = \sum_i |x_i| = \|x\|_1.

In this sense, the space of Radon measures provides a generalization of {\ell^1}.

We may sample a Radon measure {\mu} with {n+1} linear functionals and these can be encoded by {n+1} continuous functions {u_0,\dots,u_n} as

\displaystyle  b_k = \int_I u_k d\mu.

This sampling gives a bounded linear operator {K:\mathfrak{M}\rightarrow {\mathbb R}^{n+1}}. The generalization of Basis Pursuit is then given by

\displaystyle  \min_{\mu\in\mathfrak{M}} \|\mu\|\ \text{s.t.}\ K\mu = b.

This was introduced and called “Support Pursuit” in the preprint Exact Reconstruction using Support Pursuit by Yohann de Castro and Frabrice Gamboa.

More on the motivation and the use of Radon measures for sparsity can be found in Inverse problems in spaces of measures by Kristian Bredies and Hanna Pikkarainen.

2. Exact reconstruction of sparse nonnegative Radon measures

Before I talk about the results we may count the degrees of freedom a sparse Radon measure has: If {\mu = \sum_{i=1}^s x_i \delta_{t_i}} with some {s} than {\mu} is defined by the {s} weights {x_i} and the {s} positions {t_i}. Hence, we expect that at least {2s} linear measurements should be necessary to reconstruct {\mu}. Surprisingly, this is almost enough if we know that the measure is nonnegative! We only need one more measurement, that is {2s+1} and moreover, we can take fairly simple measurements, namely the monomials: {u_i(t) = t^i} {i=0,\dots, n} (with the convention that {u_0(t)\equiv 1}). This is shown in the following theorem by de Castro and Gamboa.

Theorem 1 Let {\mu = \sum_{i=1}^s x_i\delta_{t_i}} with {x_i\geq 0}, {n=2s} and let {u_i}, {i=0,\dots n} be the monomials as above. Define {b_i = \int_I u_i(t)d\mu}. Then {\mu} is the unique solution of the support pursuit problem, that is of

\displaystyle  \min \|\nu\|\ \text{s.t.}\ K\nu = b.\qquad \textup{(SP)}

Proof: The following polynomial will be of importance: For a constant {c>0} define

\displaystyle  P(t) = 1 - c \prod_{i=1}^s (t-t_i)^2.

The following properties of {P} will be used:

  1. {P(t_i) = 1} for {i=1,\dots,s}
  2. {P} has degree {n=2s} and hence, is a linear combination of the {u_i}, {i=0,\dots,n}, i.e. {P = \sum_{k=0}^n a_k u_k}.
  3. For {c} small enough it holds for {t\neq t_i} that {|P(t)|<1}.

Now let {\sigma} be a solution of (SP). We have to show that {\|\mu\|\leq \|\sigma\|}. Due to property 2 we know that

\displaystyle  \int_I u_k d\sigma = (K\sigma)k = b_k = \int_I u_k d\mu.

Due to property 1 and non-negativity of {\mu} we conclude that

\displaystyle  \begin{array}{rcl}  \|\mu\| & = & \sum_{i=1}^s x_i = \int_I P d\mu\\ & = & \int_I \sum_{k=0}^n a_k u_k d\mu\\ & = & \sum_{k=0}^n a_k \int_I u_k d\mu\\ & = & \sum_{k=0}^n a_k \int_I u_k d\sigma\\ & = & \int_I P d\sigma. \end{array}

Moreover, by Lebesgue’s decomposition we can decompose {\sigma} with respect to {\mu} such that

\displaystyle  \sigma = \underbrace{\sum_{i=1}^s y_i\delta_{t_i}}_{=\sigma_1} + \sigma_2

and {\sigma_2} is singular with respect to {\mu}. We get

\displaystyle  \begin{array}{rcl}  \int_I P d\sigma = \sum_{i=1}^s y_i + \int P d\sigma_2 \leq \|\sigma_1\| + \|\sigma_2\|=\|\sigma\| \end{array}

and we conclude that {\|\sigma\| = \|\mu\|} and especially {\int_I P d\sigma_2 = \|\sigma_2\|}. This shows that {\mu} is a solution to {(SP)}. It remains to show uniqueness. We show the following: If there is a {\nu\in\mathfrak{M}} with support in {I\setminus\{x_1,\dots,x_s\}} such that {\int_I Pd\nu = \|\nu\|}, then {\nu=0}. To see this, we build, for any {r>0}, the sets

\displaystyle  \Omega_r = [-1,1]\setminus \bigcup_{i=1}^s ]x_i-r,x_i+r[.

and assume that there exists {r>0} such that {\|\nu|_{\Omega_r}\|\neq 0} ({\nu|_{\Omega_r}} denoting the restriction of {\nu} to {\Omega_r}). However, it holds by property 3 of {P} that

\displaystyle  \int_{\Omega_r} P d\nu < \|\nu|_{\Omega_r}\|

and consequently

\displaystyle  \begin{array}{rcl}  \|\nu\| &=& \int Pd\nu = \int_{\Omega_r} Pd\nu + \int_{\Omega_r^C} P d\nu\\ &<& \|\nu|_{\Omega_r}\| + \|\nu|_{\Omega_r^C}\| = \|\nu\| \end{array}

which is a contradiction. Hence, {\nu|_{\Omega_r}=0} for all {r} and this implies {\nu=0}. Since {\sigma_2} has its support in {I\setminus\{x_1,\dots,x_s\}} we conclude that {\sigma_2=0}. Hence the support of {\sigma} is exactly {\{x_1,\dots,x_s\}}. and since {K\sigma = b = K\mu} and hence {K(\sigma-\mu) = 0}. This can be written as a Vandermonde system

\displaystyle  \begin{pmatrix} u_0(t_1)& \dots &u_0(t_s)\\ \vdots & & \vdots\\ u_n(t_1)& \dots & u(t_s) \end{pmatrix} \begin{pmatrix} y_1 - x_1\\ \vdots\\ y_s - x_s \end{pmatrix} = 0

which only has the zero solution, giving {y_i=x_i}. \Box

3. Generalization to other measurements

The measurement by monomials may sound a bit unusual. However, de Castro and Gamboa show more. What really matters here is that the monomials for a so-called Chebyshev-System (or Tchebyscheff-system or T-system – by the way, have you ever tried to google for a T-system?). This is explained, for example in the book “Tchebycheff Systems: With Applications in Analysis and Statistics” by Karlin and Studden. A T-system on {I} is simply a set of {n+1} functions {\{u_0,\dots, u_n\}} such that any linear combination of these functions has at most {n} zeros. These systems are called after Tchebyscheff since they obey many of the helpful properties of the Tchebyscheff-polynomials.

What is helpful in our context is the following theorem of Krein:

Theorem 2 (Krein) If {\{u_0,\dots,u_n\}} is a T-system for {I}, {k\leq n/2} and {t_1,\dots,t_k} are in the interior of {I}, then there exists a linear combination {\sum_{k=0}^n a_k u_k} which is non-negative and vanishes exactly the the point {t_i}.

Now consider that we replace the monomials in Theorem~1 by a T-system. You recognize that Krein’s Theorem allows to construct a “generalized polynomial” which fulfills the same requirements than the polynomial {P} is the proof of Theorem~1 as soon as the constant function 1 lies in the span of the T-system and indeed the result of Theorem~1 is also valid in that case.

4. Exact reconstruction of {s}-sparse nonnegative vectors from {2s+1} measurements

From the above one can deduce a reconstruction result for {s}-sparse vectors and I quote Theorem 2.4 from Exact Reconstruction using Support Pursuit:

Theorem 3 Let {n}, {m}, {s} be integers such that {s\leq \min(n/2,m)} and let {\{1,u_1,\dots,u_n\}} be a complete T-system on {I} (that is, {\{1,u_1,\dots,u_r\}} is a T-system on {I} for all {r<n}). Then it holds: For any distinct reals {t_1,\dots,t_m} and {A} defined as

\displaystyle  A=\begin{pmatrix} 1 & \dots & 1\\ u_1(t_1)& \dots &u_1(t_m)\\ \vdots & & \vdots\\ u_n(t_1)& \dots & u(t_m) \end{pmatrix}

Basis Pursuit recovers all nonnegative {s}-sparse vectors {x\in{\mathbb R}^m}.

5. Concluding remarks

Note that Theorem~3 gives a deterministic construction of a measurement matrix.

Also note, that nonnegativity is crucial in what we did here. This allowed (in the monomial case) to work with squares and obtain the polynomial {P} in the proof of Theorem~1 (which is also called “dual certificate” in this context). This raises the question how this method can be adapted to all sparse signals. One needs (in the monomial case) a polynomial which is bounded by 1 but matches the signs of the measure on its support. While this can be done (I think) for polynomials it seems difficult to obtain a generalization of Krein’s Theorem to this case…

In my previous post I announced the draft of the paper “Infeasible-Point Subgradient Algorithm and Computational Solver Comparison for l1-Minimization” which is now available as a preprint at optimization online.

1. Fixed bugs; different results

Basically not much has changed from the draft to the preprint, however, we had to fix some bugs in our computational comparison of solvers and this changed the results. For example, {\ell^1}-magic is now a little better, especially when combined with the heuristic support evaluation (HSE) we propose in the paper. But most notable, {\ell^1}-Homotopy is not the winner anymore. This is due to the fact that we had a conceptual error in our setup. Remember, that {\ell^1}-Homotopy solves that Basis Pursuit denoising problem

\displaystyle  \min_x \frac12\|Ax-b\|_2^2 + \lambda\|x\|_1

starting with {\lambda = \|A^Tb\|_\infty} (which results in {x=0}) and decreases {\lambda} while tracking the (piecewise linear) solution path. Provable this reaches the Basis Pursuit solution for {\lambda=0} after crossing a finite number of breaks in the solution path. However, in our first experiments we used a final parameter of {\lambda = 10^{-9}}. And that was against our rules: We only considered solvers which (in theory) calculate the exact Basis Pursuit solution. Now we reran the calculations with {\lambda=0} and surprisingly the results were worse in terms of reconstruction accuracy (of course, also in terms of speed). We did not precisely found out which part of the solver is responsible for this effect, but it should have something to do with the accuracy of the inverse of the submatrix of {A^TA} which is maintained throughout the iterations.

Another surprise was that the results for {\lambda=10^{-9}} always ended with an approximate solution accuracy (about {10^{-8}}) for all test instances (no matter what size, matrix type or number of nonzeros we used). That is a surprise because there is no formula which tells you in advance how accurate the Basis Pursuit denoising for a particular {\lambda} will be (compared to the Basis Pursuit solution). Maybe an explanation lies is the common features all our test instances share: All matrix columns are normalized to unit Euclidean norm and all non-zero entries in the solutions follow the same distribution.

If you want to have a closer look on our results you can find all the data (i.e. all the running times and solution accuracies for all solvers and all instances) on our SPEAR project website, here.

By the way: Now the overall winner is CPLEX (using the dual simplex method)! So, please stop carrying the message that standard LP solvers are not good for Basis Pursuit…

2. Testset online!

With the submission of the paper, we also made our testset publicly available. You can download all our test instances the website of our SPEAR project both as Matlab .mat files or as ASCII-data (if you would like to use another language). Remember: Each instance comes with a matrix {A}, a vector {b} and a vector {x} which is guaranteed to be the unique solution of {Ax=b} with minimal one-norm. Moreover, there are instance for which the support of the solution is that large, that the minimal-one-norm solution is not necessarily the sparsest solution anymore which is also an interesting borderline case for most Basis Pursuit solvers.

3. ISAL1 online

Also, the Matlab code of ISAL1 (infeasible point subgradient algorithm for {\ell^1}) is online at the website of our SPEAR project. Check it out if you like.

L1TestPack has just been updated to version 1.1. With the help of Andreas Tillmann I enhanced this small gadget for issues related to {\ell^1} minimization. New functions are

  • Routines to directly calculate a source element for a given matrix {A} and a vector {x^\dagger}, that is, calculate a vector {y} such that

    \displaystyle  A^* y \in\partial\|x^\dagger\|_1.

    The existence of such a vector {y} ensures that the minimization problem (the Basis Pursuit problem)

    \displaystyle  \min_x \|x\|_1\ \text{ s.t. }\ Ax = Ax^\dagger

    has the unique solution {x^\dagger} (is other words: {x^\dagger} is recovered exactly). This is particularly helpful is you are interested in unique solutions for Basis pursuit without posing strong conditions which even imply {\ell^0}{\ell^1}-equivalence.

  • Routines related to RIP constants, the ERC coefficient of Joel Tropp and the mutual coherence.
  • An implementation of the heuristic support evaluation HSE (also described in my previous post). (By the way: We were tempted to call this device “support evaluation routine” with acronym SuppER but abandoned this idea.)

Recently Andreas Tillmann presented the poster “An Infeasible-Point Subgradient Algorithm and a Computational Solver Comparison for l1-Minimization” at SPARS11. This poster summarized some results of the project SPEAR on sparse exact and approximate recovery of Marc Pfetsch an myself. We used this as an opportunity to release a draft of the accompanying paper with the same title. Although this draft is not totally ready to be submitted yet, I already summarize its content here.

Is this paper we considered the Basis Pursuit problem (beware: the linked Wikipedia page is stub at this time) from a purely optimization point of view. The Basis Pursuit problem is: For given matrix {A\in{\mathbb R}^{m\times n}} (with {m<n}) and a vector {b\in{\mathbb R}^m}, find the solution to

\displaystyle \min_{x} \|x\|_1\quad\text{s.t.}\quad Ax = b. \ \ \ \ \ (1)

Hence, we mainly neglected all its interesting features of reproducing the sparsest solution of an underdetermined linear system and so on and solely concentrated on its solution as an optimization problem.

The paper has three somehow separated contributions:

  • The new algorithm ISAL1: The problem (1) is a convex nonsmooth constrained optimization problem. Marc and Andreas are optimizers and they wondered how the most basic method for this class of problems would perform: The projected subgradient method: For solving

    \displaystyle \min_x f(x)\quad\text{s.t.}\quad x\in C

    take steps along some negative subgradient and project back to {C}: {x^{k+1} = P_C(x^k - \alpha_k h^k)}. For (1) subgradients are readily available, e.g. {h^k = \text{sgn}(x^k)} (taken coordinate-wise). However, projecting onto the constraint {Ax=b} is not too easy. Denoting the projection simply by {P}, we can give a closed form expression (assuming that {A} has full rank) as

    \displaystyle P(z) = (I - A^T (AA^T)^{-1} A) z + A^T(AA^T)^{-1}b,

    this has the drawback that one needs to explicitly invert a matrix (which, however, is just {m\times m} and hence, is usually not too large since we assume {m<<n}). However,  we proposed replace the exact projection by an approximate one: In each step we solve for the projection by a truncated conjugate gradient method. While we expected that one should increase the accuracy of the approximate projection by increasing the number of CG-steps during iteration, surprisingly that is not true: Throughout the iteration, a fixed small number of iterations (say {5} for matrices of size {1000\times 4000} but mainly independently of the size) suffices to obtain convergence (and especially feasibility of the iterates). In this paper we give a proof of convergence of the methods under several assumptions on the step-sizes and projection accuracies building on our previous paper in which we analyzed this method in the general case. Moreover, we described several ways to speed up and stabilize the subgradient method. Finally, we called this method “Infeasible point subgradient algorithm for {\ell^1}”: ISAL1. A Matlab implementation can be found and the SPEAR website.

  • HSE, the heuristic support evaluation: That’s a pretty neat device which can be integrated in any Basis Pursuit solver (beware: not Basis Pursuit denoising; we want the equality constraint). The idea is based on the following small lemma:

    Lemma 1 A feasible vector {\bar x} (i.e. {A\bar x = b}) is optimal for (1) if and only if there is {w\in{\mathbb R}^m} such that {A^Tw \in\partial\|\bar x\|_1}.

    The proof basically consists of noting that the normal cone on the constraint {\{Ax=b\}} is the image space of {A^T} and hence, the condition is equivalent to saying that this normal cone intersects the subgradient {\partial\| \bar x\|_1} which is necessary and sufficient for {\bar x} being optimal. In practice the HSE does the following:

    • deduce candidate (approx.) support {S} from a given {x}
    • compute approximate solution {\hat{w}} to {A_{S}^T w = \text{sgn}(x_{S})} by {w = (A_S^T)^\dagger\text{sgn}(x_S)} with the help of CG
    • if {\|A^T \hat{w}\|_\infty \approx 1} check existence of a {\hat{x}} with {A_{S} \hat{x}_{S} = b} and {\hat{x}_i = 0} {\forall\, i \notin S}
    • if that {\hat x} exists, check if the relative duality gap {(\|\hat{x}\|_1 + b^T (-\hat{w}))/\|\hat{x}\|_1} is small and return “success” if so, i.e. take {\hat x} as an optimal solution

    Again, CG usually performs great here and only a very few iterations (say {5}) are needed. In practice this methods did never return any vector {\hat x} marked as optimal which was wrong.

  • Computational comparison: We faced the challenge of a computational comparison for Basis Pursuit solvers.
    The first step was, to design a testset. We constructed 100 matrices (74 of which are dense, 26 are sparse) by several constructions and concatenations (see Section 5 in the draft). More complicated was the construction of appropriate right hand sides. Why? Because we wanted to have unique solutions! That is, because we wanted to have the norm difference {\|x^*-\hat x\|_2} between optimal and computed solution as a measure for both optimality and feasibility. In the first place we used the ERC due to Joel Tropp (e.g. described in this blog post of Bob’s blog). However, this does not only guarantee uniqueness of solutions but also that the minimum {1}-norm solution is also the sparsest. Since that is probably too much to have for solutions (i.e. they have to be very sparse) we constructed some more right hand sides using L1TestPack: Construct an {x} such that there is {w} such that {A^T w \in\partial \|x\|_1} and use {b = Ax}. This also leads to unique solutions for Basis Pursuit if A is injective when restricted to the columns which related to the entries in which (A^T w)_i = \pm 1 but allows for much larger supports.For the results of the comparison of ISAL1, SPGL1, YALL1, {\ell^1}-MAGIC, SparseLAB, the homotopy solves of Salman Asif and CPLEX check the paper. However, some things are interesting:

    1. homotopy is the overall winner (which is somehow clear for the instances constructed with ERC but not for others). Great work Salman!
    2. ISAL1 is quite good (although it is the simplest among all methods).
    3. HSE works great: Including it e.g. in SPGL1 produces “better” solution in less time.
    4. CPLEX is remarkably good (we used the dual simplex). So: How does it come that so many people keep saying that standard LP-solves do not work well for Basis Pursuit? That is simply not true for the dual simplex! (However, the interior point methods in CPLEX was not competitive at all.)

    We plan to make a somehow deeper evaluation of our computational results before submitting the paper to have some more detailed conclusions on the performance of the solvers an different instances.