Let {\Omega} be a compact subset of {{\mathbb R}^d} and consider the space {C(\Omega)} of continuous functions {f:\Omega\rightarrow {\mathbb R}} with the usual supremum norm. The Riesz Representation Theorem states that the dual space of {C(\Omega)} is in this case the set of all Radon measures, denoted by {\mathfrak{M}(\Omega)} and the canonical duality pairing is given by

\displaystyle  \langle\mu,f\rangle = \mu(f) = \int_\Omega fd\mu.

We can equip {\mathfrak{M}(\Omega)} with the usual notion of weak* convergence which read as

\displaystyle  \mu_n\rightharpoonup^* \mu\ \iff\ \text{for every}\ f:\ \mu_n(f)\rightarrow\mu(f).

We call a measure {\mu} positive if {f\geq 0} implies that {\mu(f)\geq 0}. If a positive measure satisfies {\mu(1)=1} (i.e. it integrates the constant function with unit value to one), we call it a probability measure and we denote with {\Delta\subset \mathfrak{M}(\Omega)} the set of all probability measures.

Example 1 Every non-negative integrable function {\phi:\Omega\rightarrow{\mathbb R}} with {\int_\Omega \phi(x)dx} induces a probability measure via

\displaystyle  f\mapsto \int_\Omega f(x)\phi(x)dx.

Quite different probability measures are the {\delta}-measures: For every {x\in\Omega} there is the {\delta}-measure at this point, defined by

\displaystyle  \delta_x(f) = f(x).

In some sense, the set {\Delta} of probability measure is the generalization of the standard simplex in {{\mathbb R}^n} to infinite dimensions (in fact uncountably many dimensions): The {\delta}-measures are the extreme points of {\Delta} and since the set {\Delta} is compact in the weak* topology, the Krein-Milman Theorem states that {\Delta} is the weak*-closure of the set of convex combinations of the {\delta}-measures – similarly as the standard simplex in {{\mathbb R}^n} is the convex combination of the canonical basis vectors of {{\mathbb R}^n}.

Remark 1 If we drop the positivity assumption and form the set

\displaystyle  O = \{\mu\in\mathfrak{M}(\Omega)\ :\ |f|\leq 1\implies |\mu(f)|\leq 1\}

we have the {O} is the set of convex combinations of the measures {\pm\delta_x} ({x\in\Omega}). Hence, {O} resembles the hyper-octahedron (aka cross polytope or {\ell^1}-ball).

I’ve taken the above (with almost similar notation) from the book “ A Course in Convexity” by Alexander Barvinok. I was curious to find (in Chapter III, Section 9) something which reads as a nice glimpse on semi-continuous compressed sensing: Proposition 9.4 reads as follows

Proposition 1 Let {g,f_1,\dots,f_m\in C(\Omega)}, {b\in{\mathbb R}^m} and suppose that the subset {B} of {\Delta} consisting of the probability measures {\mu} such that for {i=1,\dots,m}

\displaystyle  \int f_id\mu = b_i

is not empty. Then there exists {\mu^+,\mu^-\in B} such that

  1. {\mu^+} and {\mu^-} are convex combinations of at most {m+1} {\delta}-measures, and
  2. it holds that for all {\mu\in B} we have

    \displaystyle  \mu^-(g)\leq \mu(g)\leq \mu^+(g).

In terms of compressed sensing this says: Among all probability measures which comply with the data {b} measured by {m} linear measurements, there are two extremal ones which consists of {m+1} {\delta}-measures.

Note that something similar to “support-pursuit” does not work here: The minimization problem {\min_{\mu\in B, \mu(f_i)=b_i}\|\mu\|_{\mathfrak{M}}} does not make much sense, since {\|\mu\|_{\mathfrak{M}}=1} for all {\mu\in B}.

Today I report on two things I came across here at ISMP:

  • The first is a talk by Russell Luke on Constraint qualifications for nonconvex feasibility problems. Luke treated the NP-hard problem of sparsest solutions of linear systems. In fact he did not tackle this problem but the problem to find an {s}-sparse solution of an {m\times n} system of equations. He formulated this as a feasibility-problem (well, Heinz Bauschke was a collaborator) as follows: With the usual malpractice let us denote by {\|x\|_0} the number of non-zero entries of {x\in{\mathbb R}^n}. Then the problem of finding an {s}-sparse solution to {Ax=b} is:

    \displaystyle  \text{Find}\ x\ \text{in}\ \{\|x\|_0\leq s\}\cap\{Ax=b\}.

    In other words: find a feasible point, i.e. a point which lies in the intersection of the two sets. Well, most often feasibility problems involve convex sets but here, the first one given by this “{0}-norm” is definitely not convex. One of the simplest algorithms for the convex feasibility problem is to alternatingly project onto both sets. This algorithm dates back to von Neumann and has been analyzed in great detail. To make this method work for non-convex sets one only needs to know how to project onto both sets. For the case of the equality constraint {Ax=b} one can use numerical linear algebra to obtain the projection. The non-convex constraint on the number of non-zero entries is in fact even easier: For {x\in{\mathbb R}^n} the projection onto {\{\|x\|_0\leq s\}} consists of just keeping the {s} largest entries of {x} while setting the others to zero (known as the “best {s}-term approximation”). However, the theory breaks down in the case of non-convex sets. Russell treated problem in several papers (have a look at his publication page) and in the talk he focused on the problem of constraint qualification, i.e. what kind of regularity has to be imposed on the intersection of the two sets. He could shows that (local) linear convergence of the algorithm (which is observed numerically) can indeed be justified theoretically. One point which is still open is the phenomenon that the method seems to be convergent regardless of the initialization and that (even more surprisingly) that the limit point seems to be independent of the starting point (and also seems to be robust with respect to overestimating the sparsity {s}). I wondered if his results are robust with respect to inexact projections. For larger problems the projection onto the equality constraint {Ax=b} are computationally expensive. For example it would be interesting to see what happens if one approximates the projection with a truncated CG-iteration as Andreas, Marc and I did in our paper on subgradient methods for Basis Pursuit.

  • Joel Tropp reported on his paper Sharp recovery bounds for convex deconvolution, with applications together with Michael McCoy. However, in his title he used demixing instead of deconvolution (which, I think, is more appropriate and leads to less confusion). With “demixing” they mean the following: Suppose you have two signals {x_0} and {y_0} of which you observe only the superposition of {x_0} and a unitarily transformed {y_0}, i.e. for a unitary matrix {U} you observe

    \displaystyle  z_0 = x_0 + Uy_0.

    Of course, without further assumptions there is no way to recover {x_0} and {y_0} from the knowledge of {z_0} and {U}. As one motivation he used the assumption that both {x_0} and {y_0} are sparse. After the big bang of compressed sensing nobody wonders that one turns to convex optimization with {\ell^1}-norms in the following manner:

    \displaystyle   \min_{x,y} \|x\|_1 + \lambda\|y\|_1 \ \text{such that}\ x + Uy = z_0. \ \ \ \ \ (1)

    This looks a lot like sparse approximation: Eliminating {x} one obtains the unconstraint problem \begin{equation*} \min_y \|z_0-Uy\|_1 + \lambda \|y\|_1. \end{equation*}

    Phrased differently, this problem aims at finding an approximate sparse solution of {Uy=z_0} such that the residual (could also say “noise”) {z_0-Uy=x} is also sparse. This differs from the common Basis Pursuit Denoising (BPDN) by the structure function for the residual (which is the squared {2}-norm). This is due to the fact that in BPDN one usually assumes Gaussian noise which naturally lead to the squared {2}-norm. Well, one man’s noise is the other man’s signal, as we see here. Tropp and McCoy obtained very sharp thresholds on the sparsity of {x_0} and {y_0} which allow for exact recovery of both of them by solving (1). One thing which makes their analysis simpler is the following reformulation: The treated the related problem \begin{equation*} \min_{x,y} \|x\|_1 \text{such that} \|y\|_1\leq\alpha, x+Uy=z_0 \end{equation*} (which I would call this the Ivanov version of the Tikhonov-problem (1)). This allows for precise exploitation of prior knowledge by assuming that the number {\alpha_0 = \|y_0\|_1} is known.

    First I wondered if this reformulation was responsible for their unusual sharp results (sharper the results for exact recovery by BDPN), but I think it’s not. I think this is due to the fact that they have this strong assumption on the “residual”, namely that it is sparse. This can be formulated with the help of {1}-norm (which is “non-smooth”) in contrast to the smooth {2}-norm which is what one gets as prior for Gaussian noise. Moreover, McCoy and Tropp generalized their result to the case in which the structure of {x_0} and {y_0} is formulated by two functionals {f} and {g}, respectively. Assuming a kind of non-smoothness of {f} and {g} the obtain the same kind of results and especially matrix decomposition problems are covered.

In this post I just collect a few papers that caught my attention in the last moth.

I begin with Estimating Unknown Sparsity in Compressed Sensing by Miles E. Lopes. The abstract reads

Within the framework of compressed sensing, many theoretical guarantees for signal reconstruction require that the number of linear measurements {n} exceed the sparsity {\|x\|_0} of the unknown signal {x\in\mathbb{R}^p}. However, if the sparsity {\|x\|_0} is unknown, the choice of {n} remains problematic. This paper considers the problem of estimating the unknown degree of sparsity of {x} with only a small number of linear measurements. Although we show that estimation of {\|x\|_0} is generally intractable in this framework, we consider an alternative measure of sparsity {s(x):=\frac{\|x\|_1^2}{\|x\|_2^2}}, which is a sharp lower bound on {\|x\|_0}, and is more amenable to estimation. When {x} is a non-negative vector, we propose a computationally efficient estimator {\hat{s}(x)}, and use non-asymptotic methods to bound the relative error of {\hat{s}(x)} in terms of a finite number of measurements. Remarkably, the quality of estimation is dimension-free, which ensures that {\hat{s}(x)} is well-suited to the high-dimensional regime where {n<<p}. These results also extend naturally to the problem of using linear measurements to estimate the rank of a positive semi-definite matrix, or the sparsity of a non-negative matrix. Finally, we show that if no structural assumption (such as non-negativity) is made on the signal {x}, then the quantity {s(x)} cannot generally be estimated when {n<<p}.

It’s a nice combination of the observation that the quotient {s(x)} is a sharp lower bound for {\|x\|_0} and that it is possible to estimate the one-norm and the two norm of a vector {x} (with additional properties) from carefully chosen measurements. For a non-negative vector {x} you just measure with the constant-one vector which (in a noisy environment) gives you an estimate of {\|x\|_1}. Similarly, measuring with Gaussian random vector you can obtain an estimate of {\|x\|_2}.

Then there is the dissertation of Dustin Mixon on the arxiv: Sparse Signal Processing with Frame Theory which is well worth reading but too long to provide a short overview. Here is the abstract:

Many emerging applications involve sparse signals, and their processing is a subject of active research. We desire a large class of sensing matrices which allow the user to discern important properties of the measured sparse signal. Of particular interest are matrices with the restricted isometry property (RIP). RIP matrices are known to enable efficient and stable reconstruction of sfficiently sparse signals, but the deterministic construction of such matrices has proven very dfficult. In this thesis, we discuss this matrix design problem in the context of a growing field of study known as frame theory. In the first two chapters, we build large families of equiangular tight frames and full spark frames, and we discuss their relationship to RIP matrices as well as their utility in other aspects of sparse signal processing. In Chapter 3, we pave the road to deterministic RIP matrices, evaluating various techniques to demonstrate RIP, and making interesting connections with graph theory and number theory. We conclude in Chapter 4 with a coherence-based alternative to RIP, which provides near-optimal probabilistic guarantees for various aspects of sparse signal processing while at the same time admitting a whole host of deterministic constructions.

By the way, the thesis is dedicated “To all those who never dedicated a dissertation to themselves.”

Further we have Proximal Newton-type Methods for Minimizing Convex Objective Functions in Composite Form by Jason D Lee, Yuekai Sun, Michael A. Saunders. This paper extends the well explored first order methods for problem of the type {\min g(x) + h(x)} with Lipschitz-differentiable {g} or simple {\mathrm{prox}_h} to second order Newton-type methods. The abstract reads

We consider minimizing convex objective functions in composite form

\displaystyle \min_{x\in\mathbb{R}^n} f(x) := g(x) + h(x)

where {g} is convex and twice-continuously differentiable and {h:\mathbb{R}^n\rightarrow\mathbb{R}} is a convex but not necessarily differentiable function whose proximal mapping can be evaluated efficiently. We derive a generalization of Newton-type methods to handle such convex but nonsmooth objective functions. Many problems of relevance in high-dimensional statistics, machine learning, and signal processing can be formulated in composite form. We prove such methods are globally convergent to a minimizer and achieve quadratic rates of convergence in the vicinity of a unique minimizer. We also demonstrate the performance of such methods using problems of relevance in machine learning and high-dimensional statistics.

With this post I say goodbye for a few weeks of holiday.

How many samples are needed to reconstruct a sparse signal?

Well, there are many, many results around some of which you probably know (at least if you are following this blog or this one). Today I write about a neat result which I found quite some time ago on reconstruction of nonnegative sparse signals from a semi-continuous perspective.

1. From discrete sparse reconstruction/compressed sensing to semi-continuous

The basic sparse reconstruction problem asks the following: Say we have a vector {x\in{\mathbb R}^m} which only has {s<m} non-zero entries and a fat matrix {A\in{\mathbb R}^{n\times m}} (i.e. {n>m}) and consider that we are given measurements {b=Ax}. Of course, the system {Ax=b} is underdetermined. However, we may add a little more prior knowledge on the solution and ask: Is is possible to reconstruct {x} from {b} if we know that the vector {x} is sparse? If yes: How? Under what conditions on {m}, {s}, {n} and {A}? This question created the expanding universe of compressed sensing recently (and this universe is expanding so fast that for sure there has to be some dark energy in it). As a matter of fact, a powerful method to obtain sparse solutions to underdetermined systems is {\ell^1}-minimization a.k.a. Basis Pursuit on which I blogged recently: Solve

\displaystyle  \min_x \|x\|_1\ \text{s.t.}\ Ax=b

and the important ingredient here is the {\ell^1}-norm of the vector in the objective function.

In this post I’ll formulate semi-continuous sparse reconstruction. We move from an {m}-vector {x} to a finite signed measure {\mu} on a closed interval (which we assume to be {I=[-1,1]} for simplicty). We may embed the {m}-vectors into the space of finite signed measures by choosing {m} points {t_i}, {i=1,\dots, m} from the interval {I} and build {\mu = \sum_{i=1}^m x_i \delta_{t_i}} with the point-masses (or Dirac measures) {\delta_{t_i}}. To a be a bit more precise, we speak about the space {\mathfrak{M}} of Radon measures on {I}, which are defined on the Borel {\sigma}-algebra of {I} and are finite. Radon measures are not very scary objects and an intuitive way to think of them is to use Riesz representation: Every Radon measure arises as a continuous linear functional on a space of continuous functions, namely the space {C_0(I)} which is the closure of the continuous functions with compact support in {{]{-1,1}[}} with respect to the supremum norm. Hence, Radon measures work on these functions as {\int_I fd\mu}. It is also natural to speak of the support {\text{supp}(\mu)} of a Radon measure {\mu} and it holds for any continuous function {f} that

\displaystyle  \int_I f d\mu = \int_{\text{supp}(\mu)}f d\mu.

An important tool for Radon measures is the Hahn-Jordan decomposition which decomposes {\mu} into a positive part {\mu^+} and a negative part {\mu^-}, i.e. {\mu^+} and {\mu^-} are non-negative and {\mu = \mu^+-\mu^-}. Finally the variation of a measure, which is

\displaystyle  \|\mu\| = \mu^+(I) + \mu^-(I)

provides a norm on the space of Radon measures.

Example 1 For the measure {\mu = \sum_{i=1}^m x_i \delta_{t_i}} one readily calculates that

\displaystyle  \mu^+ = \sum_i \max(0,x_i)\delta_{t_i},\quad \mu^- = \sum_i \max(0,-x_i)\delta_{t_i}

and hence

\displaystyle  \|\mu\| = \sum_i |x_i| = \|x\|_1.

In this sense, the space of Radon measures provides a generalization of {\ell^1}.

We may sample a Radon measure {\mu} with {n+1} linear functionals and these can be encoded by {n+1} continuous functions {u_0,\dots,u_n} as

\displaystyle  b_k = \int_I u_k d\mu.

This sampling gives a bounded linear operator {K:\mathfrak{M}\rightarrow {\mathbb R}^{n+1}}. The generalization of Basis Pursuit is then given by

\displaystyle  \min_{\mu\in\mathfrak{M}} \|\mu\|\ \text{s.t.}\ K\mu = b.

This was introduced and called “Support Pursuit” in the preprint Exact Reconstruction using Support Pursuit by Yohann de Castro and Frabrice Gamboa.

More on the motivation and the use of Radon measures for sparsity can be found in Inverse problems in spaces of measures by Kristian Bredies and Hanna Pikkarainen.

2. Exact reconstruction of sparse nonnegative Radon measures

Before I talk about the results we may count the degrees of freedom a sparse Radon measure has: If {\mu = \sum_{i=1}^s x_i \delta_{t_i}} with some {s} than {\mu} is defined by the {s} weights {x_i} and the {s} positions {t_i}. Hence, we expect that at least {2s} linear measurements should be necessary to reconstruct {\mu}. Surprisingly, this is almost enough if we know that the measure is nonnegative! We only need one more measurement, that is {2s+1} and moreover, we can take fairly simple measurements, namely the monomials: {u_i(t) = t^i} {i=0,\dots, n} (with the convention that {u_0(t)\equiv 1}). This is shown in the following theorem by de Castro and Gamboa.

Theorem 1 Let {\mu = \sum_{i=1}^s x_i\delta_{t_i}} with {x_i\geq 0}, {n=2s} and let {u_i}, {i=0,\dots n} be the monomials as above. Define {b_i = \int_I u_i(t)d\mu}. Then {\mu} is the unique solution of the support pursuit problem, that is of

\displaystyle  \min \|\nu\|\ \text{s.t.}\ K\nu = b.\qquad \textup{(SP)}

Proof: The following polynomial will be of importance: For a constant {c>0} define

\displaystyle  P(t) = 1 - c \prod_{i=1}^s (t-t_i)^2.

The following properties of {P} will be used:

  1. {P(t_i) = 1} for {i=1,\dots,s}
  2. {P} has degree {n=2s} and hence, is a linear combination of the {u_i}, {i=0,\dots,n}, i.e. {P = \sum_{k=0}^n a_k u_k}.
  3. For {c} small enough it holds for {t\neq t_i} that {|P(t)|<1}.

Now let {\sigma} be a solution of (SP). We have to show that {\|\mu\|\leq \|\sigma\|}. Due to property 2 we know that

\displaystyle  \int_I u_k d\sigma = (K\sigma)k = b_k = \int_I u_k d\mu.

Due to property 1 and non-negativity of {\mu} we conclude that

\displaystyle  \begin{array}{rcl}  \|\mu\| & = & \sum_{i=1}^s x_i = \int_I P d\mu\\ & = & \int_I \sum_{k=0}^n a_k u_k d\mu\\ & = & \sum_{k=0}^n a_k \int_I u_k d\mu\\ & = & \sum_{k=0}^n a_k \int_I u_k d\sigma\\ & = & \int_I P d\sigma. \end{array}

Moreover, by Lebesgue’s decomposition we can decompose {\sigma} with respect to {\mu} such that

\displaystyle  \sigma = \underbrace{\sum_{i=1}^s y_i\delta_{t_i}}_{=\sigma_1} + \sigma_2

and {\sigma_2} is singular with respect to {\mu}. We get

\displaystyle  \begin{array}{rcl}  \int_I P d\sigma = \sum_{i=1}^s y_i + \int P d\sigma_2 \leq \|\sigma_1\| + \|\sigma_2\|=\|\sigma\| \end{array}

and we conclude that {\|\sigma\| = \|\mu\|} and especially {\int_I P d\sigma_2 = \|\sigma_2\|}. This shows that {\mu} is a solution to {(SP)}. It remains to show uniqueness. We show the following: If there is a {\nu\in\mathfrak{M}} with support in {I\setminus\{x_1,\dots,x_s\}} such that {\int_I Pd\nu = \|\nu\|}, then {\nu=0}. To see this, we build, for any {r>0}, the sets

\displaystyle  \Omega_r = [-1,1]\setminus \bigcup_{i=1}^s ]x_i-r,x_i+r[.

and assume that there exists {r>0} such that {\|\nu|_{\Omega_r}\|\neq 0} ({\nu|_{\Omega_r}} denoting the restriction of {\nu} to {\Omega_r}). However, it holds by property 3 of {P} that

\displaystyle  \int_{\Omega_r} P d\nu < \|\nu|_{\Omega_r}\|

and consequently

\displaystyle  \begin{array}{rcl}  \|\nu\| &=& \int Pd\nu = \int_{\Omega_r} Pd\nu + \int_{\Omega_r^C} P d\nu\\ &<& \|\nu|_{\Omega_r}\| + \|\nu|_{\Omega_r^C}\| = \|\nu\| \end{array}

which is a contradiction. Hence, {\nu|_{\Omega_r}=0} for all {r} and this implies {\nu=0}. Since {\sigma_2} has its support in {I\setminus\{x_1,\dots,x_s\}} we conclude that {\sigma_2=0}. Hence the support of {\sigma} is exactly {\{x_1,\dots,x_s\}}. and since {K\sigma = b = K\mu} and hence {K(\sigma-\mu) = 0}. This can be written as a Vandermonde system

\displaystyle  \begin{pmatrix} u_0(t_1)& \dots &u_0(t_s)\\ \vdots & & \vdots\\ u_n(t_1)& \dots & u(t_s) \end{pmatrix} \begin{pmatrix} y_1 - x_1\\ \vdots\\ y_s - x_s \end{pmatrix} = 0

which only has the zero solution, giving {y_i=x_i}. \Box

3. Generalization to other measurements

The measurement by monomials may sound a bit unusual. However, de Castro and Gamboa show more. What really matters here is that the monomials for a so-called Chebyshev-System (or Tchebyscheff-system or T-system – by the way, have you ever tried to google for a T-system?). This is explained, for example in the book “Tchebycheff Systems: With Applications in Analysis and Statistics” by Karlin and Studden. A T-system on {I} is simply a set of {n+1} functions {\{u_0,\dots, u_n\}} such that any linear combination of these functions has at most {n} zeros. These systems are called after Tchebyscheff since they obey many of the helpful properties of the Tchebyscheff-polynomials.

What is helpful in our context is the following theorem of Krein:

Theorem 2 (Krein) If {\{u_0,\dots,u_n\}} is a T-system for {I}, {k\leq n/2} and {t_1,\dots,t_k} are in the interior of {I}, then there exists a linear combination {\sum_{k=0}^n a_k u_k} which is non-negative and vanishes exactly the the point {t_i}.

Now consider that we replace the monomials in Theorem~1 by a T-system. You recognize that Krein’s Theorem allows to construct a “generalized polynomial” which fulfills the same requirements than the polynomial {P} is the proof of Theorem~1 as soon as the constant function 1 lies in the span of the T-system and indeed the result of Theorem~1 is also valid in that case.

4. Exact reconstruction of {s}-sparse nonnegative vectors from {2s+1} measurements

From the above one can deduce a reconstruction result for {s}-sparse vectors and I quote Theorem 2.4 from Exact Reconstruction using Support Pursuit:

Theorem 3 Let {n}, {m}, {s} be integers such that {s\leq \min(n/2,m)} and let {\{1,u_1,\dots,u_n\}} be a complete T-system on {I} (that is, {\{1,u_1,\dots,u_r\}} is a T-system on {I} for all {r<n}). Then it holds: For any distinct reals {t_1,\dots,t_m} and {A} defined as

\displaystyle  A=\begin{pmatrix} 1 & \dots & 1\\ u_1(t_1)& \dots &u_1(t_m)\\ \vdots & & \vdots\\ u_n(t_1)& \dots & u(t_m) \end{pmatrix}

Basis Pursuit recovers all nonnegative {s}-sparse vectors {x\in{\mathbb R}^m}.

5. Concluding remarks

Note that Theorem~3 gives a deterministic construction of a measurement matrix.

Also note, that nonnegativity is crucial in what we did here. This allowed (in the monomial case) to work with squares and obtain the polynomial {P} in the proof of Theorem~1 (which is also called “dual certificate” in this context). This raises the question how this method can be adapted to all sparse signals. One needs (in the monomial case) a polynomial which is bounded by 1 but matches the signs of the measure on its support. While this can be done (I think) for polynomials it seems difficult to obtain a generalization of Krein’s Theorem to this case…