Sparsity


Let {\Omega} be a compact subset of {{\mathbb R}^d} and consider the space {C(\Omega)} of continuous functions {f:\Omega\rightarrow {\mathbb R}} with the usual supremum norm. The Riesz Representation Theorem states that the dual space of {C(\Omega)} is in this case the set of all Radon measures, denoted by {\mathfrak{M}(\Omega)} and the canonical duality pairing is given by

\displaystyle  \langle\mu,f\rangle = \mu(f) = \int_\Omega fd\mu.

We can equip {\mathfrak{M}(\Omega)} with the usual notion of weak* convergence which read as

\displaystyle  \mu_n\rightharpoonup^* \mu\ \iff\ \text{for every}\ f:\ \mu_n(f)\rightarrow\mu(f).

We call a measure {\mu} positive if {f\geq 0} implies that {\mu(f)\geq 0}. If a positive measure satisfies {\mu(1)=1} (i.e. it integrates the constant function with unit value to one), we call it a probability measure and we denote with {\Delta\subset \mathfrak{M}(\Omega)} the set of all probability measures.

Example 1 Every non-negative integrable function {\phi:\Omega\rightarrow{\mathbb R}} with {\int_\Omega \phi(x)dx} induces a probability measure via

\displaystyle  f\mapsto \int_\Omega f(x)\phi(x)dx.

Quite different probability measures are the {\delta}-measures: For every {x\in\Omega} there is the {\delta}-measure at this point, defined by

\displaystyle  \delta_x(f) = f(x).

In some sense, the set {\Delta} of probability measure is the generalization of the standard simplex in {{\mathbb R}^n} to infinite dimensions (in fact uncountably many dimensions): The {\delta}-measures are the extreme points of {\Delta} and since the set {\Delta} is compact in the weak* topology, the Krein-Milman Theorem states that {\Delta} is the weak*-closure of the set of convex combinations of the {\delta}-measures – similarly as the standard simplex in {{\mathbb R}^n} is the convex combination of the canonical basis vectors of {{\mathbb R}^n}.

Remark 1 If we drop the positivity assumption and form the set

\displaystyle  O = \{\mu\in\mathfrak{M}(\Omega)\ :\ |f|\leq 1\implies |\mu(f)|\leq 1\}

we have the {O} is the set of convex combinations of the measures {\pm\delta_x} ({x\in\Omega}). Hence, {O} resembles the hyper-octahedron (aka cross polytope or {\ell^1}-ball).

I’ve taken the above (with almost similar notation) from the book “ A Course in Convexity” by Alexander Barvinok. I was curious to find (in Chapter III, Section 9) something which reads as a nice glimpse on semi-continuous compressed sensing: Proposition 9.4 reads as follows

Proposition 1 Let {g,f_1,\dots,f_m\in C(\Omega)}, {b\in{\mathbb R}^m} and suppose that the subset {B} of {\Delta} consisting of the probability measures {\mu} such that for {i=1,\dots,m}

\displaystyle  \int f_id\mu = b_i

is not empty. Then there exists {\mu^+,\mu^-\in B} such that

  1. {\mu^+} and {\mu^-} are convex combinations of at most {m+1} {\delta}-measures, and
  2. it holds that for all {\mu\in B} we have

    \displaystyle  \mu^-(g)\leq \mu(g)\leq \mu^+(g).

In terms of compressed sensing this says: Among all probability measures which comply with the data {b} measured by {m} linear measurements, there are two extremal ones which consists of {m+1} {\delta}-measures.

Note that something similar to “support-pursuit” does not work here: The minimization problem {\min_{\mu\in B, \mu(f_i)=b_i}\|\mu\|_{\mathfrak{M}}} does not make much sense, since {\|\mu\|_{\mathfrak{M}}=1} for all {\mu\in B}.

ISMP LogoISMP is over now and I’m already home. I do not have many things to report on from the last day. This is not due the lower quality of the talks but due to the fact that I was a little bit exhausted, as usual at the end of a five-day conference. However, I collect a few things for the record:

  • In the morning I visited the semi-planary by Xiaojun Chenon non-convex and non-smooth minimization with smoothing methods. Not surprisingly, she treated the problem

    \displaystyle \min_x f(x) + \|x\|_p^p

    with convex and smooth {f:{\mathbb R}^n\rightarrow{\mathbb R}} and {0<p<1}. She proposed and analyzed smoothing methods, that is, to smooth the problem a bit to obtain a Lipschitz-continuous objective function {\phi_\epsilon}, minimizing this and then gradually decreasing {\epsilon}. This works, as she showed. If I remember correctly, she also treated “iteratively reweighted least squares” as I described in my previous post. Unfortunately, she did not include the generalized forward-backward methods based on {\text{prox}}-functions for non-convex functions. Kristian and I pursued this approach in our paper Minimization of non-smooth, non-convex functionals by iterative thresholding and some special features of our analysis include:

    • A condition which excludes some (but not all) local minimizers from being global.
    • An algorithm which avoids this non-global minimizers by carefully adjusting the steplength of the method.
    • A result that the number of local minimizers is still finite, even if the problem is posed in {\ell^2({\mathbb N})} and not in {{\mathbb R}^n}.

    Most of our results hold true, if the {p}-quasi-norm is replaced by functions of the form

    \displaystyle \sum_n \phi_n(|x_n|)

    with special non-convex {\phi}, namely fulfilling a list of assumptions like

    • {\phi'(x) \rightarrow \infty} for {x\rightarrow 0} (infinite slope at {0}) and {\phi(x)\rightarrow\infty} for {x\rightarrow\infty} (mild coercivity),
    • {\phi'} strictly convex on {]0,\infty[} and {\phi'(x)/x\rightarrow 0} for {x\rightarrow\infty},
    • for each {b>0} there is {a>0} such that for {x<b} it holds that {\phi(x)>ax^2}, and
    • local integrability of some section of {\partial\phi'(x) x}.

    As one easily sees, {p}-quasi-norms fulfill the assumptions and some other interesting functions as well (e.g. some with very steep slope at {0} like {x\mapsto \log(x^{1/3}+1)}).

  • Jorge Nocedalgave a talk on second-order methods for non-smooth problems and his main example was a functional like

    \displaystyle \min_x f(x) + \|x\|_1

    with a convex and smooth {f}, but different from Xiaojun Chen, he only considered the {1}-norm. His talked is among the best planary talks I have ever attended and it was a great pleasure to listen to him. He carefully explained things and put them in perspective. In the case he skipped slides, he made me feel that I either did not miss an important thing, or understood them even though he didn’t show them He argued that it is not necessarily more expensive to use second order information in contrast to first order methods. Indeed, the {1}-norm can be used to reduce the number of degrees of freedom for a second order step. What was pretty interesting is, that he advocated semismooth Newton methods for this problem. Roland and I pursued this approach some time ago in our paper A Semismooth Newton Method for Tikhonov Functionals with Sparsity Constraints and, if I remember correctly (my notes are not complete at this point), his family of methods included our ssn-method. The method Roland and I proposed worked amazingly well in the cases in which it converged but the method suffered from non-global convergence. We had some preliminary ideas for globalization, which we could not tune enough to retain the speed of the method, and abandoned the topic. Now, that the topic will most probably be revived by the community, I am looking forward to fresh ideas here.

Today I report on two things I came across here at ISMP:

  • The first is a talk by Russell Luke on Constraint qualifications for nonconvex feasibility problems. Luke treated the NP-hard problem of sparsest solutions of linear systems. In fact he did not tackle this problem but the problem to find an {s}-sparse solution of an {m\times n} system of equations. He formulated this as a feasibility-problem (well, Heinz Bauschke was a collaborator) as follows: With the usual malpractice let us denote by {\|x\|_0} the number of non-zero entries of {x\in{\mathbb R}^n}. Then the problem of finding an {s}-sparse solution to {Ax=b} is:

    \displaystyle  \text{Find}\ x\ \text{in}\ \{\|x\|_0\leq s\}\cap\{Ax=b\}.

    In other words: find a feasible point, i.e. a point which lies in the intersection of the two sets. Well, most often feasibility problems involve convex sets but here, the first one given by this “{0}-norm” is definitely not convex. One of the simplest algorithms for the convex feasibility problem is to alternatingly project onto both sets. This algorithm dates back to von Neumann and has been analyzed in great detail. To make this method work for non-convex sets one only needs to know how to project onto both sets. For the case of the equality constraint {Ax=b} one can use numerical linear algebra to obtain the projection. The non-convex constraint on the number of non-zero entries is in fact even easier: For {x\in{\mathbb R}^n} the projection onto {\{\|x\|_0\leq s\}} consists of just keeping the {s} largest entries of {x} while setting the others to zero (known as the “best {s}-term approximation”). However, the theory breaks down in the case of non-convex sets. Russell treated problem in several papers (have a look at his publication page) and in the talk he focused on the problem of constraint qualification, i.e. what kind of regularity has to be imposed on the intersection of the two sets. He could shows that (local) linear convergence of the algorithm (which is observed numerically) can indeed be justified theoretically. One point which is still open is the phenomenon that the method seems to be convergent regardless of the initialization and that (even more surprisingly) that the limit point seems to be independent of the starting point (and also seems to be robust with respect to overestimating the sparsity {s}). I wondered if his results are robust with respect to inexact projections. For larger problems the projection onto the equality constraint {Ax=b} are computationally expensive. For example it would be interesting to see what happens if one approximates the projection with a truncated CG-iteration as Andreas, Marc and I did in our paper on subgradient methods for Basis Pursuit.

  • Joel Tropp reported on his paper Sharp recovery bounds for convex deconvolution, with applications together with Michael McCoy. However, in his title he used demixing instead of deconvolution (which, I think, is more appropriate and leads to less confusion). With “demixing” they mean the following: Suppose you have two signals {x_0} and {y_0} of which you observe only the superposition of {x_0} and a unitarily transformed {y_0}, i.e. for a unitary matrix {U} you observe

    \displaystyle  z_0 = x_0 + Uy_0.

    Of course, without further assumptions there is no way to recover {x_0} and {y_0} from the knowledge of {z_0} and {U}. As one motivation he used the assumption that both {x_0} and {y_0} are sparse. After the big bang of compressed sensing nobody wonders that one turns to convex optimization with {\ell^1}-norms in the following manner:

    \displaystyle   \min_{x,y} \|x\|_1 + \lambda\|y\|_1 \ \text{such that}\ x + Uy = z_0. \ \ \ \ \ (1)

    This looks a lot like sparse approximation: Eliminating {x} one obtains the unconstraint problem \begin{equation*} \min_y \|z_0-Uy\|_1 + \lambda \|y\|_1. \end{equation*}

    Phrased differently, this problem aims at finding an approximate sparse solution of {Uy=z_0} such that the residual (could also say “noise”) {z_0-Uy=x} is also sparse. This differs from the common Basis Pursuit Denoising (BPDN) by the structure function for the residual (which is the squared {2}-norm). This is due to the fact that in BPDN one usually assumes Gaussian noise which naturally lead to the squared {2}-norm. Well, one man’s noise is the other man’s signal, as we see here. Tropp and McCoy obtained very sharp thresholds on the sparsity of {x_0} and {y_0} which allow for exact recovery of both of them by solving (1). One thing which makes their analysis simpler is the following reformulation: The treated the related problem \begin{equation*} \min_{x,y} \|x\|_1 \text{such that} \|y\|_1\leq\alpha, x+Uy=z_0 \end{equation*} (which I would call this the Ivanov version of the Tikhonov-problem (1)). This allows for precise exploitation of prior knowledge by assuming that the number {\alpha_0 = \|y_0\|_1} is known.

    First I wondered if this reformulation was responsible for their unusual sharp results (sharper the results for exact recovery by BDPN), but I think it’s not. I think this is due to the fact that they have this strong assumption on the “residual”, namely that it is sparse. This can be formulated with the help of {1}-norm (which is “non-smooth”) in contrast to the smooth {2}-norm which is what one gets as prior for Gaussian noise. Moreover, McCoy and Tropp generalized their result to the case in which the structure of {x_0} and {y_0} is formulated by two functionals {f} and {g}, respectively. Assuming a kind of non-smoothness of {f} and {g} the obtain the same kind of results and especially matrix decomposition problems are covered.

The second day of ISMP started (for me) with the session I organized and chaired.

The first talk was by Michael Goldman on Continuous Primal-Dual Methods in Image Processing. He considered the continuous Arrow-Hurwitz method for saddle point problems

\displaystyle \min_{u}\max_{\xi} K(u,\xi)

with {K} convex in the first and concave in the second variable. The continuous Arrow-Hurwitz method consists of solving

\displaystyle \begin{array}{rcl} \partial_t u(t) &=& -\nabla_u K(u(t),\xi(t))\\ \partial_t \xi(t) &=& \nabla_\xi K(u(t),\xi(t)). \end{array}

His talk evolved around the problem if {K} comes from a functional which contains the total variation, namely he considered

\displaystyle K(u,\xi) = -\int_\Omega u\text{div}(\xi) + G(u)

with the additional constraints {\xi\in C^1_C(\Omega,{\mathbb R}^2)} and {|\xi|\leq 1}. For the case of {G(u) = \lambda\|u-f\|^2/2} he presented a nice analysis of the problem including convergence of the method to a solution of the primal problem and some a-posteriori estimates. This reminded me of Showalters method for the regularization of ill-posed problems. The Arrow-Hurwitz method looks like a regularized version of Showalters method and hence, early stopping does not seem to be necessary for regularization. The related paper is Continuous Primal-Dual Methods for Image Processing.

The second talk was given by Elias Helou and was on Incremental Algorithms for Convex Non-Smooth Optimization with Applications in Image Reconstructions. He presented his work on a very general framework for problems of the class

\displaystyle \min_{x\in X} f(x)

with a convex function {f} and a convex set {X}. Basically, he abstracted the properties of the projected subgradient method. This consists of taking subgradient descent steps for {f} followed by projection onto {X} iteratively: With a subgradient {g^n\in\partial f(x^n)} this reads as

\displaystyle x^{n+1} = P_X(x^n -\alpha_n g^n)

he extracted the conditions one needs from the subgradient descent step and from the projection step and formulated an algorithm which consists of successive application of an “optimality operator” {\mathcal{O}_f} (replacing the subgradient step) and a feasibility operator {\mathcal{F}_X} (replacing the projection step). The algorithm then reads as

\displaystyle \begin{array}{rcl} x^{n+1/2} &=& \mathcal{O}_f(x^n,\alpha_n)\\ x^{n+1} &=& \mathcal{F}_x(x^{n+1/2} \end{array}

and he showed convergence under the extracted conditions. The related paper is , Incremental Subgradients for Constrained Convex Optimization: a Unified Framework and New Methods.

The third talk was by Jerome Fehrenbach on Stripes removel in images, apllications in microscopy. He considered the problem of very specific noise which is appear in the form of stripes (and appears, for example, “single plane illumination microscopy”). In fact he considered a little more general case and the model he proposed was as follows: The observed image is

\displaystyle u_{\text{OBS}} = u + n,

i.e. the usual sum of the true image {u} and noise {n}. However, for the noise he assumed that it is given by

\displaystyle n = \sum_{j=1}^m \psi_j*\lambda_j,

i.e. it is a sum of different convolutions. The {\psi_j} are kind of shape-functions which describe the “pattern of the noise” and the {\lambda_j} are samples of noise processes, following specific distributions (could be white noise realizations, impulsive noise or something else)-. He then formulated a variational method to identify the variables {\lambda_j} which reads as

\displaystyle \min \|\nabla(u_{\text{OBS}} - \sum_{j=1}^m \psi_j*\lambda_j)\|_1 + \sum_j \phi_j(\lambda_j).

Basically, this is the usual variational approach to image denoising, but nor the optimization variable is the noise rather than the image. This is due to the fact that the noise has a specific complicated structure and the usual formulation with {u = u_{\text{OBS}} +n} is not feasible. He used the primal-dual algorithm by Chambolle and Pock for this problem and showed that the method works well on real world problems.

Another theme which caught my attention here is “optimization with variational inequalities as constraints”. At first glance that sounds pretty awkward. Variational inequalities can be quite complicated things and why on earth would somebody considers these things as side conditions in optimization problems? In fact there are good reasons to do so. One reason is, if you have to deal with bi-level optimization problems. Consider an optimization problem

\displaystyle \min_{x\in C} F(x,p) \ \ \ \ \ (1)

 

with convex {C} and {F(\cdot,p)} (omitting regularity conditions which could be necessary to impose) depending on a parameter {p}. Now consider the case that you want to choose the parameter {p} in an optimal way, i.e. it solves another optimization problem. This could look like

\displaystyle \min_p G(x)\quad\text{s.t.}\ x\ \text{solves (1)}. \ \ \ \ \ (2)

 

Now you have an optimization problem as a constraint. Now we use the optimality condition for the problem~(1): For differentiable {F}, {x^*} solves~(1) if and only if

\displaystyle \forall y\in C:\ \nabla_x F(x^*(p),p)(y-x^*(p))\geq 0.

In other words: We con reformulate (2) as

\displaystyle \min_p G(x)\quad\text{s.t.}\ \forall y\in C:\ \nabla_x F(x^*(p),p)(y-x^*(p))\geq 0. \ \ \ \ \ (3)

 

And there it is, our optimization problem with a variational inequality as constraint. Here at ISMP there are entire sessions devoted to this, see here and here.

The scientific program at ISMP started today and I planned to write a small personal summary of each day. However, it is a very intense meeting. Lot’s of excellent talks, lot’s of people to meet and little spare time. So I’m afraid that I have to deviate from my plan a little bit. Instead of a summary of every day I just pick out a few events. I remark that these picks do not reflect quality, significance or something like this in any way. I just pick things for which I have something to record for personal reasons.

My day started after the first plenary which the session Testing environments for machine learning and compressed sensing in which my own talk was located. The session started with the talk by Michael Friedlander of the SPOT toolbox. Haven’t heard of SPOT yet? Take a look! In a nutshell its a toolbox which turns MATLAB into “OPLAB”, i.e. it allows to treat abstract linear operators like matrices. By the way, the code is on github.

The second talk was by Katya Scheinberg (who is giving a semi-planary talk on derivative free optimization at the moment…). She talked about speeding up FISTA by cleverly adjusting step-sizes and over-relaxation parameters and generalizing these ideas to other methods like alternating direction methods. Notably, she used the “SPEAR test instances” from our project homepage! (And credited them as “surprisingly hard sparsity problems”.)

My own talk was the third and last one in that session. I talked about the issue of constructing test instance for Basis Pursuit Denoising. I argued that the naive approach (which takes a matrix {A}, a right hand side {b} and a parameter {\lambda} and let some great solver run for a while to obtain a solution {x^*}) may suffer from “trusted method bias”. I proposed to use “reverse instance construction” which is: First choose {A}, {\lambda} and the solution {x^*} and the construct the right hand side {b} (I blogged on this before here).

Last but not least, I’d like to mention the talk by Thomas Pock: He talked about parameter selection on variational models (think of the regularization parameter in Tikhonov, for example). In a paper with Karl Kunisch titled A bilevel optimization approach for parameter learning in variational models they formulated this as a bi-level optimization problem. An approach which seemed to have been overdue! Although they treat somehow simple inverse problems (well, denoising) (but with not so easy regularizers) it is a promising first step in this direction.

In this post I just collect a few papers that caught my attention in the last moth.

I begin with Estimating Unknown Sparsity in Compressed Sensing by Miles E. Lopes. The abstract reads

Within the framework of compressed sensing, many theoretical guarantees for signal reconstruction require that the number of linear measurements {n} exceed the sparsity {\|x\|_0} of the unknown signal {x\in\mathbb{R}^p}. However, if the sparsity {\|x\|_0} is unknown, the choice of {n} remains problematic. This paper considers the problem of estimating the unknown degree of sparsity of {x} with only a small number of linear measurements. Although we show that estimation of {\|x\|_0} is generally intractable in this framework, we consider an alternative measure of sparsity {s(x):=\frac{\|x\|_1^2}{\|x\|_2^2}}, which is a sharp lower bound on {\|x\|_0}, and is more amenable to estimation. When {x} is a non-negative vector, we propose a computationally efficient estimator {\hat{s}(x)}, and use non-asymptotic methods to bound the relative error of {\hat{s}(x)} in terms of a finite number of measurements. Remarkably, the quality of estimation is dimension-free, which ensures that {\hat{s}(x)} is well-suited to the high-dimensional regime where {n<<p}. These results also extend naturally to the problem of using linear measurements to estimate the rank of a positive semi-definite matrix, or the sparsity of a non-negative matrix. Finally, we show that if no structural assumption (such as non-negativity) is made on the signal {x}, then the quantity {s(x)} cannot generally be estimated when {n<<p}.

It’s a nice combination of the observation that the quotient {s(x)} is a sharp lower bound for {\|x\|_0} and that it is possible to estimate the one-norm and the two norm of a vector {x} (with additional properties) from carefully chosen measurements. For a non-negative vector {x} you just measure with the constant-one vector which (in a noisy environment) gives you an estimate of {\|x\|_1}. Similarly, measuring with Gaussian random vector you can obtain an estimate of {\|x\|_2}.

Then there is the dissertation of Dustin Mixon on the arxiv: Sparse Signal Processing with Frame Theory which is well worth reading but too long to provide a short overview. Here is the abstract:

Many emerging applications involve sparse signals, and their processing is a subject of active research. We desire a large class of sensing matrices which allow the user to discern important properties of the measured sparse signal. Of particular interest are matrices with the restricted isometry property (RIP). RIP matrices are known to enable efficient and stable reconstruction of sfficiently sparse signals, but the deterministic construction of such matrices has proven very dfficult. In this thesis, we discuss this matrix design problem in the context of a growing field of study known as frame theory. In the first two chapters, we build large families of equiangular tight frames and full spark frames, and we discuss their relationship to RIP matrices as well as their utility in other aspects of sparse signal processing. In Chapter 3, we pave the road to deterministic RIP matrices, evaluating various techniques to demonstrate RIP, and making interesting connections with graph theory and number theory. We conclude in Chapter 4 with a coherence-based alternative to RIP, which provides near-optimal probabilistic guarantees for various aspects of sparse signal processing while at the same time admitting a whole host of deterministic constructions.

By the way, the thesis is dedicated “To all those who never dedicated a dissertation to themselves.”

Further we have Proximal Newton-type Methods for Minimizing Convex Objective Functions in Composite Form by Jason D Lee, Yuekai Sun, Michael A. Saunders. This paper extends the well explored first order methods for problem of the type {\min g(x) + h(x)} with Lipschitz-differentiable {g} or simple {\mathrm{prox}_h} to second order Newton-type methods. The abstract reads

We consider minimizing convex objective functions in composite form

\displaystyle \min_{x\in\mathbb{R}^n} f(x) := g(x) + h(x)

where {g} is convex and twice-continuously differentiable and {h:\mathbb{R}^n\rightarrow\mathbb{R}} is a convex but not necessarily differentiable function whose proximal mapping can be evaluated efficiently. We derive a generalization of Newton-type methods to handle such convex but nonsmooth objective functions. Many problems of relevance in high-dimensional statistics, machine learning, and signal processing can be formulated in composite form. We prove such methods are globally convergent to a minimizer and achieve quadratic rates of convergence in the vicinity of a unique minimizer. We also demonstrate the performance of such methods using problems of relevance in machine learning and high-dimensional statistics.

With this post I say goodbye for a few weeks of holiday.

In this post I gladly announce that three problems that bothered me have been solved: The computational complexity of certifying RIP and NSP and the number of steps the homotopy method needs to obtain a solution of the Basis Pursuit problem.

1. Complexity of RIP and NSP

On this issue we have two papers:

The first paper has the more general results and hence, we start with the second one: The main result of the second paper is this:

Theorem 1 Let a matrix {A}, a positive integer {K} and some {0<\delta<1} be given. It is hard for NP under randomized polynomial-time reductions to check if {A} satisfies the {(K,\delta)} restricted isometry property.

That does not yet say that it’s NP-hard to check if {\delta} is an RIP constant for {K}-sparse vectors but it’s close. I think that Dustin Mixon has explained this issue better on his blog than I could do here.

In the first paper (which is, by the way, on outcome of the SPEAR-project in which I am involved…) the main result is indeed the conjectured NP-hardness of calculating RIP constants:

Theorem 2 For a given matrix {A} and a positive integer {K}, it is NP-hard to compute the restricted isometry constant.

Moreover, this is just a corollary to the main theorem of that paper which reads as

Theorem 3 For a given matrix {A} and a positive integer {K}, the problem to decide whether {A} satisfies the restricted isometry property of order {K} for some constant {\delta<1} is coNP-complete.

They also provide a slightly strengthened version of Theorem~1:

Theorem 4 Let a matrix {A}, a positive integer {K} and some {0<\delta<1} be given. It is coNP-complete to check if {A} satisfies the {(K,\delta)} restricted isometry property.

Moreover, the paper by Pfetsch and Tillmann also proves something about the null space property (NSP):

Definition 5 A matrix {A} satisfies the null space property of order {K} if there is a constant {\alpha>0} such that for all elements {x} in the null space of {A} it holds that the sum of the {K} largest absolute values of {x} is smaller that {\alpha} times the 1-norm of {x}. The smallest such constant {\alpha} is called the null space constant of order {K}.

Their main result is as follows:

Theorem 6 F or a given matrix {A} and a positive integer {K}, the problem to decide whether {A} satisfies the null space property order {K} for some constant {\alpha<1} is coNP-complete. Consequently, it is NP-hard to compute the null space constant of {A}.

2. Complexity of the homotopy method for Basis Pursuit

The second issue is about the basis pursuit problem

\displaystyle  \min_x \|x\|_1\quad\text{s.t.}\ Ax=b.

which can be approximated by the “denoising variant”

\displaystyle  \min_x \lambda\|x\|_1 + \tfrac12\|Ax-b\|_2^2.

What is pretty interesting about the denoising variant is, that the solution {x(\lambda)} (if it is unique throughout) depends on {\lambda} in a piecewise linear way and converges to the solution of basis pursuit for {\lambda\rightarrow 0}. This leads to an algorithm for the solution of basis pursuit: Start with {\lambda=\|A^Tb\|_\infty} (for which the unique solution is {x(\lambda)=0}), calculate the direction of the “solution path”, follow it until you reach a “break point”, calculate the next direction and so on until {\lambda} hits zero. This is for example implemented for MATLAB in L1Homotopy (the SPAMS package also seems to have this implemented, however, I haven’t used it yet). In practice, this approach (usually called homotopy method) is pretty fast and moreover, only detects a few break points. However, an obvious upper bound on the number of break point is exponential in the number of entries in {x}. Hence, it seemed that one was faced with a situation similar to the simplex method for linear programming: The algorithms performs great an average but the worst case complexity is bad. That this is really true for linear programming is known since some time by the Klee-Minty example, an example for which the simplex method takes an exponential number of steps. What I asked myself for some time: Is there a Klee-Minty example for the homotopy method?

Now the answer is there: Yes, there is!

The denoising variant of basis pursuit is also known as LASSO regularization in the statistics literature and this explains the title of the paper which comes up with the example:

Julien and Bin investigate the number of linear segments in the regularization path and first observe that this is upper bounded by {(3^p+1)/2} is {p} is the number of entries in {x} (i.e. the number of variables of the problem). Then they try to construct an instance that matches this upper bound. They succeed in a clever way: For a given instance {(A,b)} with a path with {k} linear segments they try to construct an instance which has one more variable such that the number of linear segments in increased by a factor. Their result goes like this:

Theorem 7 Let {A\in{\mathbb R}^{n\times p}} have full rank and let {b\in{\mathbb R}^n} be in the range of {A}. Assume that the homotopy path has {k} linear segments and denote by {\lambda_1} the regularization parameter which corresponds to the smallest kink in the path. Now choose {b_{n+1}\neq 0} and {\alpha} such that

\displaystyle   0<\alpha < \frac{\lambda_1}{2\|b\|_2^2 + b_{n+1}^2} \ \ \ \ \ (1)

and define {\tilde b\in{\mathbb R}^{n+1}} and {\tilde A\in{\mathbb R}{(n+1)\times (p+1)}} by

\displaystyle  \tilde b = \begin{bmatrix} b\\ b_{n+1} \end{bmatrix}, \quad \tilde A = \begin{bmatrix} A & 2\alpha b\\ 0 & \alpha b_{n+1} \end{bmatrix}.

Then the homotopy path for the basis pursuit problem with matrix {\tilde A} and right hand side {\tilde b} has {3k-1} linear segments.

With this theorem at hand, it is straightforward to recursively build a “Mairal-Yu”-example which matches the upper bound for the number of linear segments. The idea is to start with a {1\times 1} example and let it grow by one row and one column according to Theorem~7. We start with the simplest {1\times 1} example, namely {A = [1]} and {b=[1]}. To move to the next bigger example you can choose the next entry {b_{n+1}} and we always choose {1} for convenience. Moreover, you need the next {\alpha} and you need to know the smallest kink in the path. I calculated the paths and kinks with L1Packv2 by Ignace Loris because it is written in Mathematica and can use exact arithmetics with rational numbers (and you will see, that accuracy will be an issue even for small instances) and seemed bullet proof for me. Let’s see where this idea brings us:

Example 1 (Mairal-Yu example)

  • Stage 1: We start with {n=p=1}, {b=[1]} and {A=[1]}. The homotopy path has one kink at {\lambda_1=1} (with corresponding solution {[0]}) and hence, two linear segments. Now let’s go to the next larger instance:
  • Stage 2: We can choose the entry {b_2} as we like and choose it equals to 1, i.e. our new {b} is

    \displaystyle  b = \begin{bmatrix} 1\\1 \end{bmatrix}.

    Now we have to choose {\alpha} according to (1), i.e

    \displaystyle  0 < \alpha < \frac{\lambda_1}{2\|b\|_2^2 + b_{n+1}^2} = \frac{1}{2+1} = \frac{1}{3}

    and we can choose, e.g., {\alpha = 1/4} which gives our new matrix

    \displaystyle  A = \begin{bmatrix} 1 & \frac12\\ 0 & \frac14 \end{bmatrix}.

    The calculation of the new regularization path shows that it has exactly the announced number of 5 segments and the parameter of the smallest kink is {\lambda_1 = \frac{1}{13}}.

  • Stage 2: Again we choose {b_{n+1} = 1} giving

    \displaystyle  b = \begin{bmatrix} 1\\1\\1 \end{bmatrix}

    For the choice of {\alpha} we need that

    \displaystyle  0<\alpha < \frac{1}{13(4+1)} = \frac{1}{75}

    and we may choose

    \displaystyle  \alpha = \frac1{80}.

    which gives the next matrix

    \displaystyle  A = \begin{bmatrix} 1 & \frac12 & \tfrac{1}{40}\\ 0 & \frac14 & \tfrac{1}{40}\\ 0 & 0 & \tfrac{1}{80} \end{bmatrix}.

    We calculate the regularization path, observe that it has the predicted 14 segments and that the parameter of the smallest kink is {\lambda_1 = \frac{1}{193}}.

  • Stage 3: Again we choose {b_{n+1} = 1} giving

    \displaystyle  b = \begin{bmatrix} 1\\1\\1\\1 \end{bmatrix}

    For the choice of {\alpha} we need that

    \displaystyle  0<\alpha < \frac{1}{193(6+1)} = \frac{1}{1351}

    and we see that things are getting awkward here…

Proceeding in this way we always increase the number of linear segments {k_n} for the {n\times n}-case from {k_n} to {k_{n+1} = 3k_n-1} in each step and one checks easily that this leads to {k_n = (3^n+1)/2} which is the worst case! If you are interested in the regularization path: I produced picture for the first three dimensions (well, I could not draw a 4d {\ell^1}-ball) and here they are:

1d Mairal-Yu example


2d Mairal-Yu example


3d Mairal-Yu example

It is not really easy to perceive the whole paths from the pictures because the magnitude of the entries vary strongly. I’ve drawn the path in red, each kink marked with a small circle. Moreover, I have drawn the according {\ell^1}-balls of the respective radii to provide more geometric information.

The paper by Mairal and Yu has more results of the paths if one looks for approximate solutions of the linear system but I will not go into detail about them here.

At least two questions come to mind:

  • The Mairal-Yu example is {n\times n}. What is the worst case complexity for the true rectangular case? In other words: What is the complexity for {p\times n} in terms of {p} and {n}?
  • The example and the construction leads to matrices that does not have normed columns and moreover, they are far from being equal in norm. But matrices with normed columns seem to be more “well behaved”. Does the worst case complexity lowers if the consider matrices with unit-norm columns? Probably one can construct a unit-norm example by proper choice of {b}

Today I would like to comment on two arxiv-preprints I stumbled upon:

1. “Augmented L1 and Nuclear-Norm Models with a Globally Linearly Convergent Algorithm” – The Elastic Net rediscovered

The paper “Augmented L1 and Nuclear-Norm Models with a Globally Linearly Convergent Algorithm” by Ming-Jun Lai and Wotao Yin is another contribution to a field which is (or was?) probably the fastest growing field in applied mathematics: Algorithms for convex problems with non-smooth {\ell^1}-like terms. The “mother problem” here is as follows: Consider a matrix {A\in{\mathbb R}^{m\times n}}, {b\in{\mathbb R}^m} try to find a solution of

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1\quad\text{s.t.}\quad Ax=b

or, for {\sigma>0}

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1\quad\text{s.t.}\quad \|Ax-b\|\leq\sigma

which appeared here on this blog previously. Although this is a convex problem and even has a reformulation as linear program, some instances of this problem are notoriously hard to solve and gained a lot of attention (because their applicability in sparse recovery and compressed sensing). Very roughly speaking, a part of its hardness comes from the fact that the problem is neither smooth nor strictly convex.

The contribution of Lai and Yin is that they analyze a slight perturbation of the problem which makes its solution much easier: They add another term in the objective; for {\alpha>0} they consider

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1 + \frac{1}{2\alpha}\|x\|_2^2\quad\text{s.t.}\quad Ax=b

or

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1 + \frac{1}{2\alpha}\|x\|_2^2\quad\text{s.t.}\quad \|Ax-b\|\leq\sigma.

This perturbation does not make the problem smooth but renders it strongly convex (which usually makes the dual more smooth). It turns out that this perturbation makes life with this problem (and related ones) much easier – recovery guarantees still exists and algorithms behave better.

I think it is important to note that the “augmentation” of the {\ell^1} objective with an additional squared {\ell^2}-term goes back to Zou and Hastie from the statistics community. There, the motivation was as follows: They observed that the pure {\ell^1} objective tends to “overpromote” sparsity in the sense that if there are two columns in {A} which are almost equally good in explaining some component of {b} then only one of them is used. The “augmented problem”, however, tends to use both of them. They coined the method as “elastic net” (for reasons which I never really got).

I also worked on elastic-net problems for problems in the form

\displaystyle  \min_x \frac{1}{2}\|Ax-b\|^2 + \alpha\|x\|_1 + \frac{\beta}{2}\|x\|_2^2

in this paper (doi-link). Here it also turns out that the problem gets much easier algorithmically. I found it very convenient to rewrite the elastic-net problem as

\displaystyle  \min_x \frac{1}{2}\|\begin{bmatrix}A\\ \sqrt{\beta} I\end{bmatrix}x-\begin{bmatrix}b\\ 0\end{bmatrix}\|^2 + \alpha\|x\|_1

which turns the elastic-net problem into just another {\ell^1}-penalized problem with a special matrix and right hand side. Quite convenient for analysis and also somehow algorithmically.

2. Towards a Mathematical Theory of Super-Resolution

The second preprint is “Towards a Mathematical Theory of Super-Resolution” by Emmanuel Candes and Carlos Fernandez-Granda.

The idea of super-resolution seems to pretty old and, very roughly speaking, is to extract a higher resolution of a measured quantity (e.g. an image) than the measured data allows. Of course, in this formulation this is impossible. But often one can gain something by additional knowledge of the image. Basically, this also is the idea behind compressed sensing and hence, it does not come as a surprise that the results in compressed sensing are used to try to explain when super-resolution is possible.

The paper by Candes and Fernandez-Granada seems to be pretty close in spirit to Exact Reconstruction using Support Pursuit on which I blogged earlier. They model the sparse signal as a Radon measure, especially as a sum of Diracs. However, different from the support-pursuit-paper they use complex exponentials (in contrast to real polynomials). Their reconstruction method is basically the same as support pursuit: The try to solve

\displaystyle   \min_{x\in\mathcal{M}} \|x\|\quad\text{s.t.}\quad Fx=y, \ \ \ \ \ (1)

i.e. they minimize over the set of Radon measures {\mathcal{M}} under the constraint that certain measurements {Fx\in{\mathbb R}^n} result in certain given values {y}. Moreover, they make a thorough analysis of what is “reconstructable” by their ansatz and obtain a lower bound on the distance of two Diracs (in other words, a lower bound in the Prokhorov distance). I have to admit that I do not share one of their claims from the abstract: “We show that one can super-resolve these point sources with infinite precision—i.e. recover the exact locations and amplitudes—by solving a simple convex program.” My point is that I can not see to what extend the problem (1) is a simple one. Well, it is convex, but it does not seem to be simple.

I want to add that the idea of “continuous sparse modelling” in the space of signed measures is very appealing to me and appeared first in Inverse problems in spaces of measures by Kristian Bredies and Hanna Pikkarainen.

Recently the recipients of Sloan Fellowships for 2012 has been announced. This is a kind grant/price awarded to young scientists, usually people who are assistant professors on the tenure track, from the U.S. and Canada. While the actual award is not exorbitant (but still large) the Sloan Fellowships seem to be a good indicator for further success in the career and, more importantly, for awaited breakthroughs by the fellows.

This year there are 20 mathematicians among the recipients and it appeared that I knew three of them:

  • Rachel Ward (UTexas at Austin), a student of Ingrid Daubechies, has written a very nice PhD Thesis “Freedom through imperfection” on signal processing based on redundancy. Among several interesting works she has a very recent preprint Stable image reconstruction using total variation minimization which I plan to cover in a future post (according to the abstract the paper provides rigorous theory for exact recovery with discrete total variation which is cool).
  • Ben Recht (University of Wisconsin, Madison) also works in the field of signal processing (among other fields) and is especially known for his work on exact matrix completion with compressed sensing techniques (together with Emmanuel Candes); I also liked his paper “The Convex Geometry of Linear Inverse Problems” (with Venkat Chandrasekaran, Pablo A. Parrilo, and Alan Willsky) with elaborated on the generalization of {\ell^1}-minimization and nuclear norm minimization.
  • Greg Blekherman (Georgia Tech) works (among other things) on algebraic geometry and especially on the problem which non-negative polynomials (in more the one variable) can be written as sums of squares of polynomials, a question which is related to one of Hilbert’s 23 problems, namely Hilbert’s seventeenth problem. An interesting thing about non-negative polynomials and sums of squares is that: 1. Checking if a polynomial is non-negative is NP-hard. 2. Checking is a polynomial is a sum of squares can be done fast. 3. It seem that “most” non-negative polynomials are actually sums of squares but it unclear “how many of them”, see Greg’s chapter of the forthcoming book “Semidefinite Optimization and Convex Algebraic Geometry” here.

Congratulations!

Today I write about sparse recovery of “multidimensional” signal. With “multidimensional” I mean something like this: A one-dimensional signal is a vector {x\in{\mathbb R}^n} while a two-dimensional signal is a matrix {x\in{\mathbb R}^{n_1\times n_2}}. Similarly, {d}-dimensional signal is {x\in{\mathbb R}^{n_1\times\cdots\times n_d}}. Of course, images as two-dimensional signals, come time mind. Moreover, movies are three-dimensional, a hyperspectral 2D image (which has a whole spectrum attached to any pixel) is also three-dimensional), and time-dependent volume-data is four-dimensional.

Multidimensional data is often a challenge due the large amount of data. While the size of the signals is usually not the problem it is more the size of the measurement matrices. In the context of compressed sensing or sparse recovery the signal is measured with a linear operator, i.e. one applies a number {m} of linear functionals to the signal. In the {d}-dimensional case this can be encoded as a matrix {A\in {\mathbb R}^{m\times \prod_1^d n_i}} and this is where the trouble with the data comes in: If you have megapixel image (which is still quite small) the matrix has a million of columns and if you have a dense matrix, storage becomes an issue.

One approach (which is indeed quite old) to tackle this problem is, to consider special measurement matrices: If the signal has a sparse structure is every slice, i.e. every vectors of the form {x(i_1,\dots,i_{k-1},:,i_{k+1},\dots,i_d)} where we fix all but the {k}-th component, then the Kronecker product of measurement matrices for each dimension is the right thing.

The Kronecker product of two matrices {A\in{\mathbb R}^{m\times n}} and {B\in{\mathbb R}^{k\times j}} is the {mk\times nj} matrix

\displaystyle  A\otimes B = \begin{bmatrix} a_{11}B & \dots & a_{1n}B\\ \vdots & & \vdots\\ a_{n1}B & \dots & a_{nn}B \end{bmatrix}.

This has a lot to do with the tensor product and you should read the Wikipedia entry. Moreover, it is numerically advantageous not to build the Kronecker product of dense matrices if you only want to apply it to a given signal. To see this, we introduce the vectorization operator {\text{vec}:{\mathbb R}^{m\times n}\rightarrow{\mathbb R}^{nm}} which takes a matrix {X} and stacks its columns into a tall column vector. For matrices {A} and {B} (of fitting sizes) it holds that

\displaystyle  (B^T\otimes A)\text{vec}(X) = \text{vec}(AXB).

So, multiplying {X} from the left and from the right gives the application of the Kronecker product.

The use of Kronecker products in numerical linear algebra is fairly old (for example they are helpful for multidimensional finite difference schemes where you can build Kronecker products of sparse difference operators in respective dimensions). Recently, they have been discovered for compressed sensing and sparse recovery in these two papers: Sparse solutions to underdetermined Kronecker product systems by Sadegh Jokar and Volker Mehrmann and the more recent Kronecker Compressed Sensing by Marco Duarte and Rich Baraniuk.

From these papers you can extract some interestingly simple and nice theorems:

Theorem 1 For matrices {A_1,\dots A_d} with restricted isometry constant {\delta_K(A_1),\dots,\delta_K(A_d)} of order {K} it holds that the restricted isometry constant of the Kronecker product fulfills

\displaystyle  \max_i \delta_K(A_i) \leq \delta_K(A_1\otimes\cdots\otimes A_d) \leq \prod_1^d (1+\delta_K(A_i))-1.

Basically, the RIP constant of a Kronecker product is not better than the worst one but still not too large.

Theorem 2 For matrices {A_1,\dots, A_d} with columns normalized to one, it hold that the spark of their Kronecker product fulfills

\displaystyle  \text{spark}(A\otimes \dots\otimes A_d) = = \min_i\text{spark}(A_i).

Theorem 3 For matrices {A_1,\dots, A_d} with columns normalized to one, it hold that the mutual coherence of their Kronecker product fulfills

\displaystyle  \mu(A_1\otimes\dots\otimes A_d) = \max_i \mu(A_i).

Next Page »

Follow

Get every new post delivered to your Inbox.

Join 30 other followers