With this post I delve into a topic which is somehow new to me, although I planned to look deeper into this for quite some time already. I stumbled upon the paper Gromov-Wasserstein distances and the metric approach to object matching by Facundo Mémoli which was a pleasure to read and motivated this post.

1. Comparing measures with norms and metrics

There are different notions in mathematics to compare two objects, think of the size of real numbers, the cardinality of sets or the length of the difference of two vectors. Here we will deal with not only comparison of objects but with “measures of similarity”. Two fundamental notions for this are norms in vector spaces and metrics. The norm is the stronger concept in that it uses more structure than a metric and also, every norm induces a metric but not the other way round. There are occasions in which both a norm and a metric are available but lead to different concepts of similarity. One of these instances occurs in sparse recovery, especially in the continuous formulation, e.g. as described in a previous post. Consider the unit interval ${I = [0,1]}$ and two Radon measures ${\mu_1}$ and ${\mu_2}$ on ${I}$ (${I}$ could also be an aritrary metric space). On the space of Radon measures ${\mathfrak{M}(I)}$ there is the variation norm

$\displaystyle \|\mu\|_{\mathfrak{M}}= \sup_\Pi\sum_{A\in\Pi}|\mu(A)|$

where the supremum is taken over all partitions ${\Pi}$ of ${I}$ into a finite number of measurable sets. Moreover, there are different metrics one can put on the space of Radon measures, e.g. the Prokhorov metric which is defined for two probability measures (e.g. non-negative ones with unit total mass)

$\displaystyle \begin{array}{rcl} d_P(\mu_1,\mu_2) & = & \inf\{\epsilon>0\ :\ \mu_1(A)\leq \mu_2(A^\epsilon) + \epsilon,\nonumber\\ & & \qquad \mu_2(A)\leq \mu_1(A^\epsilon) + \epsilon\ \text{for all measurable}\ A\} \end{array}$

where ${A^\epsilon}$ denotes the ${\epsilon}$-neighborhood of ${A}$. Another familiy of metrics are the Wasserstein metrics: For ${p\geq 1}$ define

$\displaystyle d_{W,p}(\mu_1,\mu_2) = \Big(\inf_\nu\int_{I\times I} |x-y|^p d\nu(x,y)\Big)^{1/p} \ \ \ \ \ (1)$

where the infimum is taken over all measure couplings of ${\mu_1}$ and ${\mu_2}$, that is, all measures ${\nu}$ on ${I\times I}$ such that for measurable ${A}$ it holds that

$\displaystyle \nu(A\times I) = \mu_1(A)\ \text{and}\ \nu(I\times A) = \mu_2(A).$

Example 1 We compare two Dirac measures ${\mu_1 = \delta_{x_1}}$ and ${\mu_2 = \delta_{x_2}}$ located at distinct points ${x_1\neq x_2}$ in ${I}$ as seen here:

The variation norm measures their distance as

$\displaystyle \|\mu_1-\mu_2\|_{\mathfrak{M}} = \sup_\Pi\sum_{A\in\Pi}|\delta_{x_1}(A) - \delta_{x_2}(A)| = 2$

(choose ${\Pi}$ such that it contains ${A_1}$ and ${A_2}$ small enough that ${x_1\in A_1}$, ${x_2\in A_2}$ but ${x_1\notin A_2}$ and ${x_2\notin A_1}$). The calculate the Prokhorov metric note that you only need to consider ${A}$‘s which contain only one of the points ${x_{1/2}}$ and hence, it evaluates to

$\displaystyle d_P(\mu_1,\mu_2) = |x_1-x_2|.$

For the Wasserstein metric we observe that there is only one possible measure coupling of ${\delta_{x_1}}$ and ${\delta_{x_2}}$, namely the measure ${\nu = \delta_{(x_1,x_2)}}$. Hence, we have

$\displaystyle d_{W,p}(\mu_1,\mu_2) = \Big(\int_{I\times I}|x-y|^pd\delta_{(x_1,x_2)}(x,y)\Big)^{1/p} = |x_1-x_2|.$

The variation norm distinguishes the two Diracs but is not able to grasp the distance of their supports. On the other hand, both metrics return the geometric distance of the supports in the underlying space ${I}$ as distance of the Diracs. Put in pictures: The variation norm of the difference measures the size ob this object

while both metrics capture the distance of the measures like here

It should not stay unnoted that convergence in both the Prokhorov metric and the Wasserstein metrics is exactly the weak convergence of probability measures.

The above example provides a motivation to study metric structures on spaces, even if they are also equipped with a norm. Another reason to shift attention from normed spaces to metric spaces is the fact that there has emerged a body of work to build a theory of analysis in metric spaces (see, e.g. this answer on mathoverflow or the book Gradient Flows: In Metric Spaces And In The Space Of Probability Measures by Ambrosio, Gigli and Savaré (which puts special emphasis on the space of probability measures)). Yet another motivation for the study of metrics in this way is the problem of comparing shapes (without being precisely defined yet): Which of these shapes look most alike?

(Note that shapes need not to be two dimensional figures, you may also think of more complex objects like surfaces in three dimensions or Riemannian manifolds.)

One may also ask the question how two compare different images defined on different shapes, i.e. different “distributions of colour” on two different shapes.

2. Comparing shapes: Metric spaces

Up to now we tried to compare different measures, defined on the same set. At least to me it seems that both the Prokhorov and the Wasserstein metrics are suited to measure the similarity of measures and in fact, they do so somehow finer than the usual norm does.

Let’s try to go one step further and ask ourselves, how we could compare two measures ${\mu_1}$ and ${\mu_2}$ which are defined on two different sets? While thinking about an answer one need to balance several things:

• The setup should be general enough to allow for the comparison of a wide range of objects.
• It should include enough structure to allow meaningful statements.
• It should lead to a measure which is easy enough to handle both analytically and computationally.

For the first and second bullet: We are going to work with measures not on arbitrary sets but on metric spaces. This will allow to measure distances between points in the sets and, as you probably know, does not pose a severe restriction. Although metric spaces are much more specific than topological spaces, we still aim at quantitative measures which are not provided by topologies. With respect to the last bullet: Note that both the Prokhorov and the Wasserstein metric are defined as infimums over fairly large and not too well structured sets (for the Prokhorov metric and need to consider all measurable sets and their ${\epsilon}$-neighborhoods, for the Wasserstein metric, one need to consider all measure couplings). While they can be handled quite well theoretically, their computational realization can be cumbersome.

In a similar spirit than Facundo Memoli’s paper we work our way up from comparing subsets of metric spaces up to comparing two different metric spaces with two measures defined on them.

2.1. Comparing compact subsets of a metric space: Hausdorff

Let ${(X,d)}$ be a compact metric space. Almost hundred years ago Hausdorff introduced a metric on the family of all non-empty compact subsets of a metric space as follows: The Hausdorff metric of two compact subsets ${A}$ and ${B}$ of ${X}$ is defined as

$\displaystyle d_H(A,B) = \inf\{\epsilon>0 \ :\ A\subset B_\epsilon,\ B \subset A_\epsilon\}$

(again, using the notion of ${\epsilon}$-neighborhood). This definition seems to be much in the spirit of the Prokhorov metric.

Proposition 2.1 in Facundo Memolis paper shows that the Hausdorff metric has an equivalent description as

$\displaystyle d_H(A,B) = \inf_R \sup_{(a,b) \in R} d(a,b)$

where the infimum is taken over all correspondences ${R}$ of ${A}$ and ${B}$, i.e., all subset ${R\subset A\times B}$ such that for all ${a\in A}$ there is ${b\in B}$ such that ${(a,b) \in R}$ and for all ${b\in B}$ there ${a\in A}$ such that ${(a,b)\in R}$. One may also say set coupling of ${A}$ and ${B}$ instead of correspondence.

Example 2 There is always the full coupling ${R = A\times B}$. Three different set couplings of two subsets ${A}$ and ${B}$ of the unit interval are shown here:

the “full one” ${A\times B}$ in green and two “slim” ones in red and orange. Other “slim” couplings can be obtained from surjective mappings ${f:A\rightarrow B}$ by ${R = \{(a,f(a))\ :\ a\in A\}}$ (or with the roles of ${A}$ and ${B}$ swapped): If you couple a set ${A}$ with itself, there is also the trivial coupling

$\displaystyle R = \{(a,a)\ : \ a\in A\}$

which is just the diagonal of ${A\times A}$

Note that the alternative definition of the Hausdorff metric is more in the spirit of the Wasserstein metric: It does not use enlarged objects (by ${\epsilon}$-neighborhoods) but couplings.

The Hausdorff metric is indeed a metric on the set ${\mathfrak{C}(X)}$ of all non-empty compact subsets of a metric space ${X}$ and if ${X}$ itself is compact it even holds that ${(\mathfrak{C}(X),d_H)}$ is a compact metric space (a result, known as Blaschke Selection Theorem).

One may say that we went up an abstraction ladder one step by moving from ${(X,d)}$ to ${(\mathfrak{C}(X),d_H)}$.

2.2. Comparing compact metric spaces: Gromov-Hausdorff

In the previous subsection we worked within one metric space ${X}$. In the book “Metric Structures for Riemannian and Non-Riemannian Spaces” Misha Gromov introduced a notion to compare two different metric spaces. For compact metric space ${X}$ and ${Y}$ the Gromov-Hausdorff metric is defined as

$\displaystyle d_{GH}(X,Y) = \inf_{Z,f,g} d_H(f(X),g(Y)) \ \ \ \ \ (2)$

where the infimum is taken over

• all metric spaces ${Z}$ and
• all isometric embeddings ${f}$ and ${g}$ which embed ${X}$ and ${Y}$ into ${Z}$ respectively.

In words: To compute the Gromov-Hausdorff metric, you try embed both ${X}$ and ${Y}$ into a common larger space isometrically such that they are as close as possible according to the Hausdorff metric in that space.

Strictly speaking, the above definition is not well stated as one can not form an infimum over all metric spaces since this collection does not form a set according to the rules of set theory. More precisely one should write that ${d_{GH}(X,Y)}$ is the infimum over all ${r>0}$ such that there exists a metric space ${Z}$ and isometric embeddings ${f}$ and ${g}$ of ${X}$ and ${Y}$, respectively, such that ${d_H(f(X),g(Y)).

As the Hausdorff metric could be reformulated with set couplings there is a reformulation of the Gromov-Hausdorff metric based on metric couplings: A metric coupling of two metric spaces ${(X,d_X)}$ and ${(Y,d_Y)}$ is a metric ${d}$ on the disjoint union ${X\sqcup Y}$ of ${X}$ and ${Y}$ such that for all ${x,x'\in X}$ and ${y,y'\in Y}$ it holds that ${d(x,x') = d_X(x,x')}$ and ${d(y,y') = d_Y(y,y')}$.

Example 3 We couple a metric space ${(X,d)}$ with itself. We denote with ${(X',d')}$ an identical copy of ${(X,d)}$ and look for a metric ${D}$ on ${X\times X'}$ that respects the metrics ${d}$ and ${d'}$ in the way a metric coupling has to.

To distinguish elements from ${X}$ and ${X'}$ we put a ${'}$ on all quantities from ${X'}$. Moreover, for ${x\in X}$ we denote by ${x'}$ its identical copy in ${X'}$ (and similarly for ${x'\in X'}$, ${x}$ is its identical twin). Then, for any ${\epsilon>0}$ we can define ${D_\epsilon(x,x') = D_\epsilon(x',x) = \epsilon}$ (i.e. the distance between any two identical twins is ${\epsilon}$. By the triangle inequality we get for ${x\in X}$ and ${y'\in X'}$ that ${D_\epsilon(x,y')}$ should fulfill

$\displaystyle D_\epsilon(x',y') - D_\epsilon(x',x) \leq D_\epsilon(x,y') \leq D_\epsilon(x,y) + D_\epsilon(y,y')$

and hence

$\displaystyle d(x,y) - \epsilon \leq D_\epsilon(x,y') \leq d(x,y) + \epsilon.$

Indeed we can choose ${D_\epsilon(x,y') = d(x,y)}$ if ${x\in X}$ and ${y'\in Y}$ leading to one specific metric coupling for any ${\epsilon}$. This couplings allow to distinguish identical twins and behave as a metric on the whole disjoint union. In the limiting case ${\epsilon\rightarrow 0}$ we do not obtain a metric but a semi-metric or pseudo-metric which is just the same as a metric but without the assumption that ${d(x,y) = 0}$ implies that ${x=y}$.

Example 4 The above example of a metric coupling of a metric space with itself was somehow “reproducing” the given metric as accurate as possible. There are also other couplings that put very different distances to points ${D(x,y')}$ and there is also a way to visualize metric couplings: When building the disjoint union of two metric spaces ${X}$ and ${Y}$, you can imagine this as isometrically embedding both in a larger metric space ${Z}$ in a non-overlapping way and obtain the metric coupling ${D}$ as the restriction of the metric on ${Z}$ to ${X\sqcup Y}$. For ${X=Y=[0,1]}$ you can embed both into ${Z = {\mathbb R}^2}$. A metric coupling which is similar (but not equal) to the coupling of the previous example is obtained by putting ${X}$ and ${Y}$ side by side at distance ${\epsilon}$ as here (one space in green, the other in blue).

A quite different coupling is obtained by putting ${X}$ and ${Y}$ side by side, but in a reversed way as here:

You may even embed them in a more weired way as here:

but remember that the embeddings has to be isometric, hence, distortions like here are not allowed.

This example illustrate that the idea of metric coupling is in similar spirit as of “embedding two spaces in a common larger one”.

With the notion of metric coupling, the Gromov-Hausdorff metric can be written as

$\displaystyle d_{GH}(X,Y) = \inf_{R,d} \sup_{(x,y)\in R} d(x,y) \ \ \ \ \ (3)$

where the infimum is taken over all set couplings ${R}$ of ${X}$ and ${Y}$ and all metric couplings ${d}$ of ${(X,d_X)}$ and ${(Y,d_Y)}$.

In words: To compute the Gromov-Hausdorff metric this way, you look for a set coupling of the base sets ${X}$ and ${Y}$ and a metric coupling ${d}$ of the metrics ${d_X}$ and ${d_Y}$ such that the maximal distance of two coupled points ${x}$ and ${y}$ is as small as possible. While this may look more complicated than the original definition from~(2), note that the original definition uses all metric spaces ${Z}$ in which you can embed ${X}$ and ${Y}$ isometrically, which seems barely impossible to realize. Granted, the new definition also considers a lot of quantities.

Also note that this definition is in spirit of the Wasserstein metric from~(1): If there were natural measures ${\mu_R}$ on the set couplings ${R}$ we could write \begin{equation*} d_{GH}(X,Y) = \inf_{R,d} \Big(\int d(x,y)^pd\mu_R\Big)^{1/p} \end{equation*} and in the limit ${p\rightarrow\infty}$ we would recover definition~(3).

Example 5 The Gromov-Hausdorff distance of a metric space ${(X,d_X)}$ to itself is easily seen to be zero: Consider the trivial coupling ${R = \{(x,x)\ :\ x\in X\}}$ from Example~2 and the family ${D_\epsilon}$ of metric couplings from Example~3. Then we have ${d_{GH}(X,X) \leq \epsilon}$ for any ${\epsilon >0}$ showing ${d_{GH}(X,X) = 0}$. Let’s take one of the next-complicated examples and compute the distance of ${X = [0,1]}$ and ${Y=[0,2]}$, both equipped with the euclidean metric. We couple the sets ${X}$ and ${Y}$ by ${R = \{(x,2x)\ : \ x\in X\}}$ and the respective metrics by embedding ${X}$ and ${Y}$ into ${{\mathbb R}^2}$ as follows: Put ${Y}$ at the line from ${(0,0)}$ to ${(2,0)}$ and ${X}$ at the line from ${(\tfrac12,\epsilon)}$ to ${(1\tfrac12,\epsilon)}$:

This shows that ${d_{GH}(X,Y) \leq \tfrac12}$ and actually, we have equality here.

There is another reformulation of the Gromov-Hausdorff metric, the equivalence of which is shown in Theorem 7.3.25 in the book “A Course in Metric Geometry” by Dmitri Burago, Yuri Burago and Sergei Ivanov:

$\displaystyle d_{GH}(X,Y) = \tfrac12\inf_R \sup_{\overset{\overset{x_{1/2}\in X}{y_{1/2}\in Y}}{(x_i,y_i)\in R}}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big| \ \ \ \ \ (4)$

where the infimum is taken over all set couplings ${R}$ of ${X}$ and ${Y}$.

In words: Look for a set coupling such that any two coupled pairs ${(x_1,y_1)}$ and ${(x_2,y_2)}$ have the “most equal” distance.

This reformulation may have the advantage over the form (3) in that is only considers the set couplings and the given metrics ${d_X}$ and ${d_Y}$ and no metric coupling is needed.

Note that, as the previous reformulation~(3), it is also in the spirit of the Wasserstein metric: If there were natural measures ${\mu_R}$ in the set couplings ${R}$, we could write

$\displaystyle d_{GH}(X,Y) = \tfrac12\inf_R \Big(\int_{R\times R}\big| d_X(x_1,x_2) - d_Y(y_1,y_2)\big|^p d\mu_R(x_1,y_1)d\mu_R(x_2,y_2)\Big)^{1/p}.$

and recover the formulation~(4) in the limit ${p\rightarrow\infty}$.

One may say that we went up an abstraction ladder one step further by moving from ${(X,d)}$ to ${(\mathfrak{C}(X),d_H)}$ to ${(\text{All compact metric spaces},d_{GH})}$.

Since this post has been grown pretty long already, I decided to do the next step (which is the already announced metric on metric spaces which additionally carry some measure on them – so-called metric measure spaces) in a later post.

Let ${\Omega}$ be a compact subset of ${{\mathbb R}^d}$ and consider the space ${C(\Omega)}$ of continuous functions ${f:\Omega\rightarrow {\mathbb R}}$ with the usual supremum norm. The Riesz Representation Theorem states that the dual space of ${C(\Omega)}$ is in this case the set of all Radon measures, denoted by ${\mathfrak{M}(\Omega)}$ and the canonical duality pairing is given by

$\displaystyle \langle\mu,f\rangle = \mu(f) = \int_\Omega fd\mu.$

We can equip ${\mathfrak{M}(\Omega)}$ with the usual notion of weak* convergence which read as

$\displaystyle \mu_n\rightharpoonup^* \mu\ \iff\ \text{for every}\ f:\ \mu_n(f)\rightarrow\mu(f).$

We call a measure ${\mu}$ positive if ${f\geq 0}$ implies that ${\mu(f)\geq 0}$. If a positive measure satisfies ${\mu(1)=1}$ (i.e. it integrates the constant function with unit value to one), we call it a probability measure and we denote with ${\Delta\subset \mathfrak{M}(\Omega)}$ the set of all probability measures.

Example 1 Every non-negative integrable function ${\phi:\Omega\rightarrow{\mathbb R}}$ with ${\int_\Omega \phi(x)dx}$ induces a probability measure via

$\displaystyle f\mapsto \int_\Omega f(x)\phi(x)dx.$

Quite different probability measures are the ${\delta}$-measures: For every ${x\in\Omega}$ there is the ${\delta}$-measure at this point, defined by

$\displaystyle \delta_x(f) = f(x).$

In some sense, the set ${\Delta}$ of probability measure is the generalization of the standard simplex in ${{\mathbb R}^n}$ to infinite dimensions (in fact uncountably many dimensions): The ${\delta}$-measures are the extreme points of ${\Delta}$ and since the set ${\Delta}$ is compact in the weak* topology, the Krein-Milman Theorem states that ${\Delta}$ is the weak*-closure of the set of convex combinations of the ${\delta}$-measures – similarly as the standard simplex in ${{\mathbb R}^n}$ is the convex combination of the canonical basis vectors of ${{\mathbb R}^n}$.

Remark 1 If we drop the positivity assumption and form the set

$\displaystyle O = \{\mu\in\mathfrak{M}(\Omega)\ :\ |f|\leq 1\implies |\mu(f)|\leq 1\}$

we have the ${O}$ is the set of convex combinations of the measures ${\pm\delta_x}$ (${x\in\Omega}$). Hence, ${O}$ resembles the hyper-octahedron (aka cross polytope or ${\ell^1}$-ball).

I’ve taken the above (with almost similar notation) from the book “ A Course in Convexity” by Alexander Barvinok. I was curious to find (in Chapter III, Section 9) something which reads as a nice glimpse on semi-continuous compressed sensing: Proposition 9.4 reads as follows

Proposition 1 Let ${g,f_1,\dots,f_m\in C(\Omega)}$, ${b\in{\mathbb R}^m}$ and suppose that the subset ${B}$ of ${\Delta}$ consisting of the probability measures ${\mu}$ such that for ${i=1,\dots,m}$

$\displaystyle \int f_id\mu = b_i$

is not empty. Then there exists ${\mu^+,\mu^-\in B}$ such that

1. ${\mu^+}$ and ${\mu^-}$ are convex combinations of at most ${m+1}$ ${\delta}$-measures, and
2. it holds that for all ${\mu\in B}$ we have

$\displaystyle \mu^-(g)\leq \mu(g)\leq \mu^+(g).$

In terms of compressed sensing this says: Among all probability measures which comply with the data ${b}$ measured by ${m}$ linear measurements, there are two extremal ones which consists of ${m+1}$ ${\delta}$-measures.

Note that something similar to “support-pursuit” does not work here: The minimization problem ${\min_{\mu\in B, \mu(f_i)=b_i}\|\mu\|_{\mathfrak{M}}}$ does not make much sense, since ${\|\mu\|_{\mathfrak{M}}=1}$ for all ${\mu\in B}$.

ISMP is over now and I’m already home. I do not have many things to report on from the last day. This is not due the lower quality of the talks but due to the fact that I was a little bit exhausted, as usual at the end of a five-day conference. However, I collect a few things for the record:

• In the morning I visited the semi-planary by Xiaojun Chenon non-convex and non-smooth minimization with smoothing methods. Not surprisingly, she treated the problem

$\displaystyle \min_x f(x) + \|x\|_p^p$

with convex and smooth ${f:{\mathbb R}^n\rightarrow{\mathbb R}}$ and ${0. She proposed and analyzed smoothing methods, that is, to smooth the problem a bit to obtain a Lipschitz-continuous objective function ${\phi_\epsilon}$, minimizing this and then gradually decreasing ${\epsilon}$. This works, as she showed. If I remember correctly, she also treated “iteratively reweighted least squares” as I described in my previous post. Unfortunately, she did not include the generalized forward-backward methods based on ${\text{prox}}$-functions for non-convex functions. Kristian and I pursued this approach in our paper Minimization of non-smooth, non-convex functionals by iterative thresholding and some special features of our analysis include:

• A condition which excludes some (but not all) local minimizers from being global.
• An algorithm which avoids this non-global minimizers by carefully adjusting the steplength of the method.
• A result that the number of local minimizers is still finite, even if the problem is posed in ${\ell^2({\mathbb N})}$ and not in ${{\mathbb R}^n}$.

Most of our results hold true, if the ${p}$-quasi-norm is replaced by functions of the form

$\displaystyle \sum_n \phi_n(|x_n|)$

with special non-convex ${\phi}$, namely fulfilling a list of assumptions like

• ${\phi'(x) \rightarrow \infty}$ for ${x\rightarrow 0}$ (infinite slope at ${0}$) and ${\phi(x)\rightarrow\infty}$ for ${x\rightarrow\infty}$ (mild coercivity),
• ${\phi'}$ strictly convex on ${]0,\infty[}$ and ${\phi'(x)/x\rightarrow 0}$ for ${x\rightarrow\infty}$,
• for each ${b>0}$ there is ${a>0}$ such that for ${x it holds that ${\phi(x)>ax^2}$, and
• local integrability of some section of ${\partial\phi'(x) x}$.

As one easily sees, ${p}$-quasi-norms fulfill the assumptions and some other interesting functions as well (e.g. some with very steep slope at ${0}$ like ${x\mapsto \log(x^{1/3}+1)}$).

• Jorge Nocedalgave a talk on second-order methods for non-smooth problems and his main example was a functional like

$\displaystyle \min_x f(x) + \|x\|_1$

with a convex and smooth ${f}$, but different from Xiaojun Chen, he only considered the ${1}$-norm. His talked is among the best planary talks I have ever attended and it was a great pleasure to listen to him. He carefully explained things and put them in perspective. In the case he skipped slides, he made me feel that I either did not miss an important thing, or understood them even though he didn’t show them He argued that it is not necessarily more expensive to use second order information in contrast to first order methods. Indeed, the ${1}$-norm can be used to reduce the number of degrees of freedom for a second order step. What was pretty interesting is, that he advocated semismooth Newton methods for this problem. Roland and I pursued this approach some time ago in our paper A Semismooth Newton Method for Tikhonov Functionals with Sparsity Constraints and, if I remember correctly (my notes are not complete at this point), his family of methods included our ssn-method. The method Roland and I proposed worked amazingly well in the cases in which it converged but the method suffered from non-global convergence. We had some preliminary ideas for globalization, which we could not tune enough to retain the speed of the method, and abandoned the topic. Now, that the topic will most probably be revived by the community, I am looking forward to fresh ideas here.

Today I report on two things I came across here at ISMP:

• The first is a talk by Russell Luke on Constraint qualifications for nonconvex feasibility problems. Luke treated the NP-hard problem of sparsest solutions of linear systems. In fact he did not tackle this problem but the problem to find an ${s}$-sparse solution of an ${m\times n}$ system of equations. He formulated this as a feasibility-problem (well, Heinz Bauschke was a collaborator) as follows: With the usual malpractice let us denote by ${\|x\|_0}$ the number of non-zero entries of ${x\in{\mathbb R}^n}$. Then the problem of finding an ${s}$-sparse solution to ${Ax=b}$ is:

$\displaystyle \text{Find}\ x\ \text{in}\ \{\|x\|_0\leq s\}\cap\{Ax=b\}.$

In other words: find a feasible point, i.e. a point which lies in the intersection of the two sets. Well, most often feasibility problems involve convex sets but here, the first one given by this “${0}$-norm” is definitely not convex. One of the simplest algorithms for the convex feasibility problem is to alternatingly project onto both sets. This algorithm dates back to von Neumann and has been analyzed in great detail. To make this method work for non-convex sets one only needs to know how to project onto both sets. For the case of the equality constraint ${Ax=b}$ one can use numerical linear algebra to obtain the projection. The non-convex constraint on the number of non-zero entries is in fact even easier: For ${x\in{\mathbb R}^n}$ the projection onto ${\{\|x\|_0\leq s\}}$ consists of just keeping the ${s}$ largest entries of ${x}$ while setting the others to zero (known as the “best ${s}$-term approximation”). However, the theory breaks down in the case of non-convex sets. Russell treated problem in several papers (have a look at his publication page) and in the talk he focused on the problem of constraint qualification, i.e. what kind of regularity has to be imposed on the intersection of the two sets. He could shows that (local) linear convergence of the algorithm (which is observed numerically) can indeed be justified theoretically. One point which is still open is the phenomenon that the method seems to be convergent regardless of the initialization and that (even more surprisingly) that the limit point seems to be independent of the starting point (and also seems to be robust with respect to overestimating the sparsity ${s}$). I wondered if his results are robust with respect to inexact projections. For larger problems the projection onto the equality constraint ${Ax=b}$ are computationally expensive. For example it would be interesting to see what happens if one approximates the projection with a truncated CG-iteration as Andreas, Marc and I did in our paper on subgradient methods for Basis Pursuit.

• Joel Tropp reported on his paper Sharp recovery bounds for convex deconvolution, with applications together with Michael McCoy. However, in his title he used demixing instead of deconvolution (which, I think, is more appropriate and leads to less confusion). With “demixing” they mean the following: Suppose you have two signals ${x_0}$ and ${y_0}$ of which you observe only the superposition of ${x_0}$ and a unitarily transformed ${y_0}$, i.e. for a unitary matrix ${U}$ you observe

$\displaystyle z_0 = x_0 + Uy_0.$

Of course, without further assumptions there is no way to recover ${x_0}$ and ${y_0}$ from the knowledge of ${z_0}$ and ${U}$. As one motivation he used the assumption that both ${x_0}$ and ${y_0}$ are sparse. After the big bang of compressed sensing nobody wonders that one turns to convex optimization with ${\ell^1}$-norms in the following manner:

$\displaystyle \min_{x,y} \|x\|_1 + \lambda\|y\|_1 \ \text{such that}\ x + Uy = z_0. \ \ \ \ \ (1)$

This looks a lot like sparse approximation: Eliminating ${x}$ one obtains the unconstraint problem \begin{equation*} \min_y \|z_0-Uy\|_1 + \lambda \|y\|_1. \end{equation*}

Phrased differently, this problem aims at finding an approximate sparse solution of ${Uy=z_0}$ such that the residual (could also say “noise”) ${z_0-Uy=x}$ is also sparse. This differs from the common Basis Pursuit Denoising (BPDN) by the structure function for the residual (which is the squared ${2}$-norm). This is due to the fact that in BPDN one usually assumes Gaussian noise which naturally lead to the squared ${2}$-norm. Well, one man’s noise is the other man’s signal, as we see here. Tropp and McCoy obtained very sharp thresholds on the sparsity of ${x_0}$ and ${y_0}$ which allow for exact recovery of both of them by solving (1). One thing which makes their analysis simpler is the following reformulation: The treated the related problem \begin{equation*} \min_{x,y} \|x\|_1 \text{such that} \|y\|_1\leq\alpha, x+Uy=z_0 \end{equation*} (which I would call this the Ivanov version of the Tikhonov-problem (1)). This allows for precise exploitation of prior knowledge by assuming that the number ${\alpha_0 = \|y_0\|_1}$ is known.

First I wondered if this reformulation was responsible for their unusual sharp results (sharper the results for exact recovery by BDPN), but I think it’s not. I think this is due to the fact that they have this strong assumption on the “residual”, namely that it is sparse. This can be formulated with the help of ${1}$-norm (which is “non-smooth”) in contrast to the smooth ${2}$-norm which is what one gets as prior for Gaussian noise. Moreover, McCoy and Tropp generalized their result to the case in which the structure of ${x_0}$ and ${y_0}$ is formulated by two functionals ${f}$ and ${g}$, respectively. Assuming a kind of non-smoothness of ${f}$ and ${g}$ the obtain the same kind of results and especially matrix decomposition problems are covered.

The scientific program at ISMP started today and I planned to write a small personal summary of each day. However, it is a very intense meeting. Lot’s of excellent talks, lot’s of people to meet and little spare time. So I’m afraid that I have to deviate from my plan a little bit. Instead of a summary of every day I just pick out a few events. I remark that these picks do not reflect quality, significance or something like this in any way. I just pick things for which I have something to record for personal reasons.

My day started after the first plenary which the session Testing environments for machine learning and compressed sensing in which my own talk was located. The session started with the talk by Michael Friedlander of the SPOT toolbox. Haven’t heard of SPOT yet? Take a look! In a nutshell its a toolbox which turns MATLAB into “OPLAB”, i.e. it allows to treat abstract linear operators like matrices. By the way, the code is on github.

The second talk was by Katya Scheinberg (who is giving a semi-planary talk on derivative free optimization at the moment…). She talked about speeding up FISTA by cleverly adjusting step-sizes and over-relaxation parameters and generalizing these ideas to other methods like alternating direction methods. Notably, she used the “SPEAR test instances” from our project homepage! (And credited them as “surprisingly hard sparsity problems”.)

My own talk was the third and last one in that session. I talked about the issue of constructing test instance for Basis Pursuit Denoising. I argued that the naive approach (which takes a matrix ${A}$, a right hand side ${b}$ and a parameter ${\lambda}$ and let some great solver run for a while to obtain a solution ${x^*}$) may suffer from “trusted method bias”. I proposed to use “reverse instance construction” which is: First choose ${A}$, ${\lambda}$ and the solution ${x^*}$ and the construct the right hand side ${b}$ (I blogged on this before here).

Last but not least, I’d like to mention the talk by Thomas Pock: He talked about parameter selection on variational models (think of the regularization parameter in Tikhonov, for example). In a paper with Karl Kunisch titled A bilevel optimization approach for parameter learning in variational models they formulated this as a bi-level optimization problem. An approach which seemed to have been overdue! Although they treat somehow simple inverse problems (well, denoising) (but with not so easy regularizers) it is a promising first step in this direction.

In this post I just collect a few papers that caught my attention in the last moth.

I begin with Estimating Unknown Sparsity in Compressed Sensing by Miles E. Lopes. The abstract reads

Within the framework of compressed sensing, many theoretical guarantees for signal reconstruction require that the number of linear measurements ${n}$ exceed the sparsity ${\|x\|_0}$ of the unknown signal ${x\in\mathbb{R}^p}$. However, if the sparsity ${\|x\|_0}$ is unknown, the choice of ${n}$ remains problematic. This paper considers the problem of estimating the unknown degree of sparsity of ${x}$ with only a small number of linear measurements. Although we show that estimation of ${\|x\|_0}$ is generally intractable in this framework, we consider an alternative measure of sparsity ${s(x):=\frac{\|x\|_1^2}{\|x\|_2^2}}$, which is a sharp lower bound on ${\|x\|_0}$, and is more amenable to estimation. When ${x}$ is a non-negative vector, we propose a computationally efficient estimator ${\hat{s}(x)}$, and use non-asymptotic methods to bound the relative error of ${\hat{s}(x)}$ in terms of a finite number of measurements. Remarkably, the quality of estimation is dimension-free, which ensures that ${\hat{s}(x)}$ is well-suited to the high-dimensional regime where ${n<. These results also extend naturally to the problem of using linear measurements to estimate the rank of a positive semi-definite matrix, or the sparsity of a non-negative matrix. Finally, we show that if no structural assumption (such as non-negativity) is made on the signal ${x}$, then the quantity ${s(x)}$ cannot generally be estimated when ${n<.

It’s a nice combination of the observation that the quotient ${s(x)}$ is a sharp lower bound for ${\|x\|_0}$ and that it is possible to estimate the one-norm and the two norm of a vector ${x}$ (with additional properties) from carefully chosen measurements. For a non-negative vector ${x}$ you just measure with the constant-one vector which (in a noisy environment) gives you an estimate of ${\|x\|_1}$. Similarly, measuring with Gaussian random vector you can obtain an estimate of ${\|x\|_2}$.

Then there is the dissertation of Dustin Mixon on the arxiv: Sparse Signal Processing with Frame Theory which is well worth reading but too long to provide a short overview. Here is the abstract:

Many emerging applications involve sparse signals, and their processing is a subject of active research. We desire a large class of sensing matrices which allow the user to discern important properties of the measured sparse signal. Of particular interest are matrices with the restricted isometry property (RIP). RIP matrices are known to enable eﬃcient and stable reconstruction of sfficiently sparse signals, but the deterministic construction of such matrices has proven very dfficult. In this thesis, we discuss this matrix design problem in the context of a growing field of study known as frame theory. In the ﬁrst two chapters, we build large families of equiangular tight frames and full spark frames, and we discuss their relationship to RIP matrices as well as their utility in other aspects of sparse signal processing. In Chapter 3, we pave the road to deterministic RIP matrices, evaluating various techniques to demonstrate RIP, and making interesting connections with graph theory and number theory. We conclude in Chapter 4 with a coherence-based alternative to RIP, which provides near-optimal probabilistic guarantees for various aspects of sparse signal processing while at the same time admitting a whole host of deterministic constructions.

By the way, the thesis is dedicated “To all those who never dedicated a dissertation to themselves.”

Further we have Proximal Newton-type Methods for Minimizing Convex Objective Functions in Composite Form by Jason D Lee, Yuekai Sun, Michael A. Saunders. This paper extends the well explored first order methods for problem of the type ${\min g(x) + h(x)}$ with Lipschitz-differentiable ${g}$ or simple ${\mathrm{prox}_h}$ to second order Newton-type methods. The abstract reads

We consider minimizing convex objective functions in composite form

$\displaystyle \min_{x\in\mathbb{R}^n} f(x) := g(x) + h(x)$

where ${g}$ is convex and twice-continuously differentiable and ${h:\mathbb{R}^n\rightarrow\mathbb{R}}$ is a convex but not necessarily differentiable function whose proximal mapping can be evaluated efficiently. We derive a generalization of Newton-type methods to handle such convex but nonsmooth objective functions. Many problems of relevance in high-dimensional statistics, machine learning, and signal processing can be formulated in composite form. We prove such methods are globally convergent to a minimizer and achieve quadratic rates of convergence in the vicinity of a unique minimizer. We also demonstrate the performance of such methods using problems of relevance in machine learning and high-dimensional statistics.

With this post I say goodbye for a few weeks of holiday.

Today I would like to comment on two arxiv-preprints I stumbled upon:

1. “Augmented L1 and Nuclear-Norm Models with a Globally Linearly Convergent Algorithm” – The Elastic Net rediscovered

The paper “Augmented L1 and Nuclear-Norm Models with a Globally Linearly Convergent Algorithm” by Ming-Jun Lai and Wotao Yin is another contribution to a field which is (or was?) probably the fastest growing field in applied mathematics: Algorithms for convex problems with non-smooth ${\ell^1}$-like terms. The “mother problem” here is as follows: Consider a matrix ${A\in{\mathbb R}^{m\times n}}$, ${b\in{\mathbb R}^m}$ try to find a solution of

$\displaystyle \min_{x\in{\mathbb R}^n}\|x\|_1\quad\text{s.t.}\quad Ax=b$

or, for ${\sigma>0}$

$\displaystyle \min_{x\in{\mathbb R}^n}\|x\|_1\quad\text{s.t.}\quad \|Ax-b\|\leq\sigma$

which appeared here on this blog previously. Although this is a convex problem and even has a reformulation as linear program, some instances of this problem are notoriously hard to solve and gained a lot of attention (because their applicability in sparse recovery and compressed sensing). Very roughly speaking, a part of its hardness comes from the fact that the problem is neither smooth nor strictly convex.

The contribution of Lai and Yin is that they analyze a slight perturbation of the problem which makes its solution much easier: They add another term in the objective; for ${\alpha>0}$ they consider

$\displaystyle \min_{x\in{\mathbb R}^n}\|x\|_1 + \frac{1}{2\alpha}\|x\|_2^2\quad\text{s.t.}\quad Ax=b$

or

$\displaystyle \min_{x\in{\mathbb R}^n}\|x\|_1 + \frac{1}{2\alpha}\|x\|_2^2\quad\text{s.t.}\quad \|Ax-b\|\leq\sigma.$

This perturbation does not make the problem smooth but renders it strongly convex (which usually makes the dual more smooth). It turns out that this perturbation makes life with this problem (and related ones) much easier – recovery guarantees still exists and algorithms behave better.

I think it is important to note that the “augmentation” of the ${\ell^1}$ objective with an additional squared ${\ell^2}$-term goes back to Zou and Hastie from the statistics community. There, the motivation was as follows: They observed that the pure ${\ell^1}$ objective tends to “overpromote” sparsity in the sense that if there are two columns in ${A}$ which are almost equally good in explaining some component of ${b}$ then only one of them is used. The “augmented problem”, however, tends to use both of them. They coined the method as “elastic net” (for reasons which I never really got).

I also worked on elastic-net problems for problems in the form

$\displaystyle \min_x \frac{1}{2}\|Ax-b\|^2 + \alpha\|x\|_1 + \frac{\beta}{2}\|x\|_2^2$

in this paper (doi-link). Here it also turns out that the problem gets much easier algorithmically. I found it very convenient to rewrite the elastic-net problem as

$\displaystyle \min_x \frac{1}{2}\|\begin{bmatrix}A\\ \sqrt{\beta} I\end{bmatrix}x-\begin{bmatrix}b\\ 0\end{bmatrix}\|^2 + \alpha\|x\|_1$

which turns the elastic-net problem into just another ${\ell^1}$-penalized problem with a special matrix and right hand side. Quite convenient for analysis and also somehow algorithmically.

2. Towards a Mathematical Theory of Super-Resolution

The second preprint is “Towards a Mathematical Theory of Super-Resolution” by Emmanuel Candes and Carlos Fernandez-Granda.

The idea of super-resolution seems to pretty old and, very roughly speaking, is to extract a higher resolution of a measured quantity (e.g. an image) than the measured data allows. Of course, in this formulation this is impossible. But often one can gain something by additional knowledge of the image. Basically, this also is the idea behind compressed sensing and hence, it does not come as a surprise that the results in compressed sensing are used to try to explain when super-resolution is possible.

The paper by Candes and Fernandez-Granada seems to be pretty close in spirit to Exact Reconstruction using Support Pursuit on which I blogged earlier. They model the sparse signal as a Radon measure, especially as a sum of Diracs. However, different from the support-pursuit-paper they use complex exponentials (in contrast to real polynomials). Their reconstruction method is basically the same as support pursuit: The try to solve

$\displaystyle \min_{x\in\mathcal{M}} \|x\|\quad\text{s.t.}\quad Fx=y, \ \ \ \ \ (1)$

i.e. they minimize over the set of Radon measures ${\mathcal{M}}$ under the constraint that certain measurements ${Fx\in{\mathbb R}^n}$ result in certain given values ${y}$. Moreover, they make a thorough analysis of what is “reconstructable” by their ansatz and obtain a lower bound on the distance of two Diracs (in other words, a lower bound in the Prokhorov distance). I have to admit that I do not share one of their claims from the abstract: “We show that one can super-resolve these point sources with infinite precision—i.e. recover the exact locations and amplitudes—by solving a simple convex program.” My point is that I can not see to what extend the problem (1) is a simple one. Well, it is convex, but it does not seem to be simple.

I want to add that the idea of “continuous sparse modelling” in the space of signed measures is very appealing to me and appeared first in Inverse problems in spaces of measures by Kristian Bredies and Hanna Pikkarainen.

Today I write about sparse recovery of “multidimensional” signal. With “multidimensional” I mean something like this: A one-dimensional signal is a vector ${x\in{\mathbb R}^n}$ while a two-dimensional signal is a matrix ${x\in{\mathbb R}^{n_1\times n_2}}$. Similarly, ${d}$-dimensional signal is ${x\in{\mathbb R}^{n_1\times\cdots\times n_d}}$. Of course, images as two-dimensional signals, come time mind. Moreover, movies are three-dimensional, a hyperspectral 2D image (which has a whole spectrum attached to any pixel) is also three-dimensional), and time-dependent volume-data is four-dimensional.

Multidimensional data is often a challenge due the large amount of data. While the size of the signals is usually not the problem it is more the size of the measurement matrices. In the context of compressed sensing or sparse recovery the signal is measured with a linear operator, i.e. one applies a number ${m}$ of linear functionals to the signal. In the ${d}$-dimensional case this can be encoded as a matrix ${A\in {\mathbb R}^{m\times \prod_1^d n_i}}$ and this is where the trouble with the data comes in: If you have megapixel image (which is still quite small) the matrix has a million of columns and if you have a dense matrix, storage becomes an issue.

One approach (which is indeed quite old) to tackle this problem is, to consider special measurement matrices: If the signal has a sparse structure is every slice, i.e. every vectors of the form ${x(i_1,\dots,i_{k-1},:,i_{k+1},\dots,i_d)}$ where we fix all but the ${k}$-th component, then the Kronecker product of measurement matrices for each dimension is the right thing.

The Kronecker product of two matrices ${A\in{\mathbb R}^{m\times n}}$ and ${B\in{\mathbb R}^{k\times j}}$ is the ${mk\times nj}$ matrix

$\displaystyle A\otimes B = \begin{bmatrix} a_{11}B & \dots & a_{1n}B\\ \vdots & & \vdots\\ a_{n1}B & \dots & a_{nn}B \end{bmatrix}.$

This has a lot to do with the tensor product and you should read the Wikipedia entry. Moreover, it is numerically advantageous not to build the Kronecker product of dense matrices if you only want to apply it to a given signal. To see this, we introduce the vectorization operator ${\text{vec}:{\mathbb R}^{m\times n}\rightarrow{\mathbb R}^{nm}}$ which takes a matrix ${X}$ and stacks its columns into a tall column vector. For matrices ${A}$ and ${B}$ (of fitting sizes) it holds that

$\displaystyle (B^T\otimes A)\text{vec}(X) = \text{vec}(AXB).$

So, multiplying ${X}$ from the left and from the right gives the application of the Kronecker product.

The use of Kronecker products in numerical linear algebra is fairly old (for example they are helpful for multidimensional finite difference schemes where you can build Kronecker products of sparse difference operators in respective dimensions). Recently, they have been discovered for compressed sensing and sparse recovery in these two papers: Sparse solutions to underdetermined Kronecker product systems by Sadegh Jokar and Volker Mehrmann and the more recent Kronecker Compressed Sensing by Marco Duarte and Rich Baraniuk.

From these papers you can extract some interestingly simple and nice theorems:

Theorem 1 For matrices ${A_1,\dots A_d}$ with restricted isometry constant ${\delta_K(A_1),\dots,\delta_K(A_d)}$ of order ${K}$ it holds that the restricted isometry constant of the Kronecker product fulfills

$\displaystyle \max_i \delta_K(A_i) \leq \delta_K(A_1\otimes\cdots\otimes A_d) \leq \prod_1^d (1+\delta_K(A_i))-1.$

Basically, the RIP constant of a Kronecker product is not better than the worst one but still not too large.

Theorem 2 For matrices ${A_1,\dots, A_d}$ with columns normalized to one, it hold that the spark of their Kronecker product fulfills

$\displaystyle \text{spark}(A\otimes \dots\otimes A_d) = = \min_i\text{spark}(A_i).$

Theorem 3 For matrices ${A_1,\dots, A_d}$ with columns normalized to one, it hold that the mutual coherence of their Kronecker product fulfills

$\displaystyle \mu(A_1\otimes\dots\otimes A_d) = \max_i \mu(A_i).$

If you want to have a sparse solution to a linear system of equation and have heard of compressed sensing or sparse reconstruction than you probably know what to do: Get one of the many solvers for Basis Pursuit and be happy.

Basis Pursuit was designed as a convex approximation of the generally intractable problem of finding the sparsest solution (that is, the solution with the smallest number of non-zero entries). By abuse of notation, we define for ${x\in\mathbb{R}^n}$

$\displaystyle \|x\|_0 = \#\{i\ : x_i\neq 0\}.$

(Because of ${\|x\|_0 = \lim_{p\rightarrow 0}\|x\|_p^p}$ some people prefer the, probably more correct but also more confusing, notation ${\|x\|_0^0}$…).

Then, the sparsest solution of ${Ax=b}$ is the solution of

$\displaystyle \min_x \|x\|_0,\quad \text{s.t.}\ Ax=b$

and Basis Pursuit replaces ${\|x\|_0}$ with “the closest convex proxy”, i.e.

$\displaystyle \min_x \|x\|_1,\quad\text{s.t.}\ Ax=b.$

The good thing about Basis Pursuit suit is, that is really gives the sparsest solution under appropriate conditions as is widely known nowadays. Here I’d like to present two simple examples in which the Basis Pursuit solution is

• not even close to the sparsest solution (by norm).
• not sparse at all.

We can build a bad matrix for Basis Pursuit, even in the case ${2\times 3}$: For a small ${\epsilon>0}$ define

$\displaystyle A = \begin{bmatrix} \epsilon & 1 & 0\\ \epsilon & 0 & 1 \end{bmatrix}, \quad b = \begin{bmatrix} 1\\1 \end{bmatrix}.$

Of course, the sparsest solution is

$\displaystyle x_0 = \begin{bmatrix} 1/\epsilon\\ 0\\ 0\end{bmatrix}$

while the solution of Basis Pursuit is

$\displaystyle x_1 = \begin{bmatrix} 0\\1\\1 \end{bmatrix}.$

The summarize: For ${\epsilon<1/2}$

$\displaystyle \|x_0\|_0 = 1 < 2 = \|x_1\|_0,\quad \|x_0\|_1 = 1/\epsilon > 2 = \|x_1\|_1.$

(There is also a least squares solution that has three non-zero entries and a one-norm slightly larger than 2.)

Granted, this matrix is stupid. Especially, its first column has a very small norm compared to the others. Ok, let’s construct a matrix with normalized columns.

2. A small bad matrix with normalized columns

Fix an integer ${n}$ and a small ${\epsilon>0}$. We define a ${n\times(n+2)}$-matrix

$\displaystyle \begin{bmatrix} 1+\epsilon/2 & -1+\epsilon/2 & 1 & 0 & \dots & \dots &0\\ -1+\epsilon/2 & 1+\epsilon/2 & 0 & 1 & \ddots & & 0\\ \epsilon/2 & \epsilon/2 & \vdots & \ddots& \ddots & \ddots& \vdots\\ \vdots & \vdots & \vdots & & \ddots & \ddots& 0\\ \epsilon/2 & \epsilon/2 & 0 & \dots& \dots& 0 & 1 \end{bmatrix}.$

Ok, the first two columns do not have norm 1 yet, so we normalize them by multiplying with the right constant

$\displaystyle c = \frac{1}{\sqrt{2 + \tfrac{n\epsilon^2}{4}}}$

(which is close to ${1/\sqrt{2}}$) to get

$\displaystyle A = \begin{bmatrix} c(1+\epsilon/2) & c(-1+\epsilon/2) & 1 & 0 & \dots & \dots &0\\ c(-1+\epsilon/2) & c(1+\epsilon/2) & 0 & 1 & \ddots & & 0\\ c\epsilon/2 & c\epsilon/2 & \vdots & \ddots& \ddots & \ddots& \vdots\\ \vdots & \vdots & \vdots & & \ddots & \ddots& 0\\ c\epsilon/2 & c\epsilon/2 & 0 & \dots& \dots& 0 & 1 \end{bmatrix}.$

Now we take the right hand side

$\displaystyle b = \begin{bmatrix} 1\\\vdots\\1 \end{bmatrix}$

and see what solutions to ${Ax=b}$ are there.

First, there is the least squares solution ${x_{\text{ls}} = A^\dagger b}$. This has only non-zero entries, the last ${n}$ entries are slightly smaller than ${1}$ and the first two are between ${0}$ and ${1}$, hence, ${\|x_{\text{ls}}\|_1 \approx n}$ (in fact, slightly larger).

Second, there is a very sparse solution

$\displaystyle x_0 = \frac{1}{\epsilon c} \begin{bmatrix} 1\\ 1\\ 0\\ \vdots\\ 0 \end{bmatrix}.$

This has two non-zero entries and a pretty large one-norm: ${\|x_0\|_1 = 2/(\epsilon c)}$.

Third there is a solution with small one-norm:

$\displaystyle x_1 = \begin{bmatrix} 0\\ 0\\ 1\\ \vdots\\ 1 \end{bmatrix}.$

We have ${n}$ non-zero entries and ${\|x_1\|_1 = n}$. You can check that this ${x_1}$ is also the unique Basis Pursuit solution (e.g. by observing that ${A^T[1,\dots,1]^T}$ is an element of ${\partial\|x_1\|_1}$ and that the first two entries in ${A^T[1,\dots,1]^T}$ are strictly smaller than 1 and positive – put differently, the vector ${[1,\dots,1]^T}$ is dual certificate for ${x_1}$).

To summarize, for ${\epsilon < \sqrt{\frac{8}{n^2-n}}}$ it holds that

$\displaystyle \|x_0\|_0 = 2 < n = \|x_1\|_0,\quad \|x_0\|_1 = 2/(c\epsilon) > n = \|x_1\|_1.$

The geometric idea behind this matrix is as follows: We take ${n}$ simple normalized columns (the identity part in ${A}$) which sum up to the right hand side ${b}$. Then we take two normalized vectors which are almost orthogonal to ${b}$ but have ${b}$ in their span (but one needs huge factors here to obtain ${b}$).

Well, this matrix looks very artificial and indeed it’s constructed for one special purpose: To show that minimal ${\ell^1}$-norm solution are not always sparse (even when a sparse solution exists). It’s some kind of a hobby for me to construct instances for sparse reconstruction with extreme properties and I am thinking about a kind of “gallery” of these instances (probably extending the “gallery” command in Matlab).
By the way: if you want to play around with this matrix, here is the code

n = 100;
epsilon = sqrt(8/(n^2-n))+0.1;
c = 1/sqrt(2+n*epsilon^2/4);
A = zeros(n,n+2);
A(1:2,1:2) = ([1 -1;-1,1]+epsilon/2)*c;
A(3:n,1:2) = epsilon/2*c;
A(1:n,3:n+2) = eye(n);
b = ones(n,1);

How many samples are needed to reconstruct a sparse signal?

Well, there are many, many results around some of which you probably know (at least if you are following this blog or this one). Today I write about a neat result which I found quite some time ago on reconstruction of nonnegative sparse signals from a semi-continuous perspective.

1. From discrete sparse reconstruction/compressed sensing to semi-continuous

The basic sparse reconstruction problem asks the following: Say we have a vector ${x\in{\mathbb R}^m}$ which only has ${s non-zero entries and a fat matrix ${A\in{\mathbb R}^{n\times m}}$ (i.e. ${n>m}$) and consider that we are given measurements ${b=Ax}$. Of course, the system ${Ax=b}$ is underdetermined. However, we may add a little more prior knowledge on the solution and ask: Is is possible to reconstruct ${x}$ from ${b}$ if we know that the vector ${x}$ is sparse? If yes: How? Under what conditions on ${m}$, ${s}$, ${n}$ and ${A}$? This question created the expanding universe of compressed sensing recently (and this universe is expanding so fast that for sure there has to be some dark energy in it). As a matter of fact, a powerful method to obtain sparse solutions to underdetermined systems is ${\ell^1}$-minimization a.k.a. Basis Pursuit on which I blogged recently: Solve

$\displaystyle \min_x \|x\|_1\ \text{s.t.}\ Ax=b$

and the important ingredient here is the ${\ell^1}$-norm of the vector in the objective function.

In this post I’ll formulate semi-continuous sparse reconstruction. We move from an ${m}$-vector ${x}$ to a finite signed measure ${\mu}$ on a closed interval (which we assume to be ${I=[-1,1]}$ for simplicty). We may embed the ${m}$-vectors into the space of finite signed measures by choosing ${m}$ points ${t_i}$, ${i=1,\dots, m}$ from the interval ${I}$ and build ${\mu = \sum_{i=1}^m x_i \delta_{t_i}}$ with the point-masses (or Dirac measures) ${\delta_{t_i}}$. To a be a bit more precise, we speak about the space ${\mathfrak{M}}$ of Radon measures on ${I}$, which are defined on the Borel ${\sigma}$-algebra of ${I}$ and are finite. Radon measures are not very scary objects and an intuitive way to think of them is to use Riesz representation: Every Radon measure arises as a continuous linear functional on a space of continuous functions, namely the space ${C_0(I)}$ which is the closure of the continuous functions with compact support in ${{]{-1,1}[}}$ with respect to the supremum norm. Hence, Radon measures work on these functions as ${\int_I fd\mu}$. It is also natural to speak of the support ${\text{supp}(\mu)}$ of a Radon measure ${\mu}$ and it holds for any continuous function ${f}$ that

$\displaystyle \int_I f d\mu = \int_{\text{supp}(\mu)}f d\mu.$

An important tool for Radon measures is the Hahn-Jordan decomposition which decomposes ${\mu}$ into a positive part ${\mu^+}$ and a negative part ${\mu^-}$, i.e. ${\mu^+}$ and ${\mu^-}$ are non-negative and ${\mu = \mu^+-\mu^-}$. Finally the variation of a measure, which is

$\displaystyle \|\mu\| = \mu^+(I) + \mu^-(I)$

provides a norm on the space of Radon measures.

Example 1 For the measure ${\mu = \sum_{i=1}^m x_i \delta_{t_i}}$ one readily calculates that

$\displaystyle \mu^+ = \sum_i \max(0,x_i)\delta_{t_i},\quad \mu^- = \sum_i \max(0,-x_i)\delta_{t_i}$

and hence

$\displaystyle \|\mu\| = \sum_i |x_i| = \|x\|_1.$

In this sense, the space of Radon measures provides a generalization of ${\ell^1}$.

We may sample a Radon measure ${\mu}$ with ${n+1}$ linear functionals and these can be encoded by ${n+1}$ continuous functions ${u_0,\dots,u_n}$ as

$\displaystyle b_k = \int_I u_k d\mu.$

This sampling gives a bounded linear operator ${K:\mathfrak{M}\rightarrow {\mathbb R}^{n+1}}$. The generalization of Basis Pursuit is then given by

$\displaystyle \min_{\mu\in\mathfrak{M}} \|\mu\|\ \text{s.t.}\ K\mu = b.$

This was introduced and called “Support Pursuit” in the preprint Exact Reconstruction using Support Pursuit by Yohann de Castro and Frabrice Gamboa.

More on the motivation and the use of Radon measures for sparsity can be found in Inverse problems in spaces of measures by Kristian Bredies and Hanna Pikkarainen.

2. Exact reconstruction of sparse nonnegative Radon measures

Before I talk about the results we may count the degrees of freedom a sparse Radon measure has: If ${\mu = \sum_{i=1}^s x_i \delta_{t_i}}$ with some ${s}$ than ${\mu}$ is defined by the ${s}$ weights ${x_i}$ and the ${s}$ positions ${t_i}$. Hence, we expect that at least ${2s}$ linear measurements should be necessary to reconstruct ${\mu}$. Surprisingly, this is almost enough if we know that the measure is nonnegative! We only need one more measurement, that is ${2s+1}$ and moreover, we can take fairly simple measurements, namely the monomials: ${u_i(t) = t^i}$ ${i=0,\dots, n}$ (with the convention that ${u_0(t)\equiv 1}$). This is shown in the following theorem by de Castro and Gamboa.

Theorem 1 Let ${\mu = \sum_{i=1}^s x_i\delta_{t_i}}$ with ${x_i\geq 0}$, ${n=2s}$ and let ${u_i}$, ${i=0,\dots n}$ be the monomials as above. Define ${b_i = \int_I u_i(t)d\mu}$. Then ${\mu}$ is the unique solution of the support pursuit problem, that is of

$\displaystyle \min \|\nu\|\ \text{s.t.}\ K\nu = b.\qquad \textup{(SP)}$

Proof: The following polynomial will be of importance: For a constant ${c>0}$ define

$\displaystyle P(t) = 1 - c \prod_{i=1}^s (t-t_i)^2.$

The following properties of ${P}$ will be used:

1. ${P(t_i) = 1}$ for ${i=1,\dots,s}$
2. ${P}$ has degree ${n=2s}$ and hence, is a linear combination of the ${u_i}$, ${i=0,\dots,n}$, i.e. ${P = \sum_{k=0}^n a_k u_k}$.
3. For ${c}$ small enough it holds for ${t\neq t_i}$ that ${|P(t)|<1}$.

Now let ${\sigma}$ be a solution of (SP). We have to show that ${\|\mu\|\leq \|\sigma\|}$. Due to property 2 we know that

$\displaystyle \int_I u_k d\sigma = (K\sigma)k = b_k = \int_I u_k d\mu.$

Due to property 1 and non-negativity of ${\mu}$ we conclude that

$\displaystyle \begin{array}{rcl} \|\mu\| & = & \sum_{i=1}^s x_i = \int_I P d\mu\\ & = & \int_I \sum_{k=0}^n a_k u_k d\mu\\ & = & \sum_{k=0}^n a_k \int_I u_k d\mu\\ & = & \sum_{k=0}^n a_k \int_I u_k d\sigma\\ & = & \int_I P d\sigma. \end{array}$

Moreover, by Lebesgue’s decomposition we can decompose ${\sigma}$ with respect to ${\mu}$ such that

$\displaystyle \sigma = \underbrace{\sum_{i=1}^s y_i\delta_{t_i}}_{=\sigma_1} + \sigma_2$

and ${\sigma_2}$ is singular with respect to ${\mu}$. We get

$\displaystyle \begin{array}{rcl} \int_I P d\sigma = \sum_{i=1}^s y_i + \int P d\sigma_2 \leq \|\sigma_1\| + \|\sigma_2\|=\|\sigma\| \end{array}$

and we conclude that ${\|\sigma\| = \|\mu\|}$ and especially ${\int_I P d\sigma_2 = \|\sigma_2\|}$. This shows that ${\mu}$ is a solution to ${(SP)}$. It remains to show uniqueness. We show the following: If there is a ${\nu\in\mathfrak{M}}$ with support in ${I\setminus\{x_1,\dots,x_s\}}$ such that ${\int_I Pd\nu = \|\nu\|}$, then ${\nu=0}$. To see this, we build, for any ${r>0}$, the sets

$\displaystyle \Omega_r = [-1,1]\setminus \bigcup_{i=1}^s ]x_i-r,x_i+r[.$

and assume that there exists ${r>0}$ such that ${\|\nu|_{\Omega_r}\|\neq 0}$ (${\nu|_{\Omega_r}}$ denoting the restriction of ${\nu}$ to ${\Omega_r}$). However, it holds by property 3 of ${P}$ that

$\displaystyle \int_{\Omega_r} P d\nu < \|\nu|_{\Omega_r}\|$

and consequently

$\displaystyle \begin{array}{rcl} \|\nu\| &=& \int Pd\nu = \int_{\Omega_r} Pd\nu + \int_{\Omega_r^C} P d\nu\\ &<& \|\nu|_{\Omega_r}\| + \|\nu|_{\Omega_r^C}\| = \|\nu\| \end{array}$

which is a contradiction. Hence, ${\nu|_{\Omega_r}=0}$ for all ${r}$ and this implies ${\nu=0}$. Since ${\sigma_2}$ has its support in ${I\setminus\{x_1,\dots,x_s\}}$ we conclude that ${\sigma_2=0}$. Hence the support of ${\sigma}$ is exactly ${\{x_1,\dots,x_s\}}$. and since ${K\sigma = b = K\mu}$ and hence ${K(\sigma-\mu) = 0}$. This can be written as a Vandermonde system

$\displaystyle \begin{pmatrix} u_0(t_1)& \dots &u_0(t_s)\\ \vdots & & \vdots\\ u_n(t_1)& \dots & u(t_s) \end{pmatrix} \begin{pmatrix} y_1 - x_1\\ \vdots\\ y_s - x_s \end{pmatrix} = 0$

which only has the zero solution, giving ${y_i=x_i}$. $\Box$

3. Generalization to other measurements

The measurement by monomials may sound a bit unusual. However, de Castro and Gamboa show more. What really matters here is that the monomials for a so-called Chebyshev-System (or Tchebyscheff-system or T-system – by the way, have you ever tried to google for a T-system?). This is explained, for example in the book “Tchebycheff Systems: With Applications in Analysis and Statistics” by Karlin and Studden. A T-system on ${I}$ is simply a set of ${n+1}$ functions ${\{u_0,\dots, u_n\}}$ such that any linear combination of these functions has at most ${n}$ zeros. These systems are called after Tchebyscheff since they obey many of the helpful properties of the Tchebyscheff-polynomials.

What is helpful in our context is the following theorem of Krein:

Theorem 2 (Krein) If ${\{u_0,\dots,u_n\}}$ is a T-system for ${I}$, ${k\leq n/2}$ and ${t_1,\dots,t_k}$ are in the interior of ${I}$, then there exists a linear combination ${\sum_{k=0}^n a_k u_k}$ which is non-negative and vanishes exactly the the point ${t_i}$.

Now consider that we replace the monomials in Theorem~1 by a T-system. You recognize that Krein’s Theorem allows to construct a “generalized polynomial” which fulfills the same requirements than the polynomial ${P}$ is the proof of Theorem~1 as soon as the constant function 1 lies in the span of the T-system and indeed the result of Theorem~1 is also valid in that case.

4. Exact reconstruction of ${s}$-sparse nonnegative vectors from ${2s+1}$ measurements

From the above one can deduce a reconstruction result for ${s}$-sparse vectors and I quote Theorem 2.4 from Exact Reconstruction using Support Pursuit:

Theorem 3 Let ${n}$, ${m}$, ${s}$ be integers such that ${s\leq \min(n/2,m)}$ and let ${\{1,u_1,\dots,u_n\}}$ be a complete T-system on ${I}$ (that is, ${\{1,u_1,\dots,u_r\}}$ is a T-system on ${I}$ for all ${r). Then it holds: For any distinct reals ${t_1,\dots,t_m}$ and ${A}$ defined as

$\displaystyle A=\begin{pmatrix} 1 & \dots & 1\\ u_1(t_1)& \dots &u_1(t_m)\\ \vdots & & \vdots\\ u_n(t_1)& \dots & u(t_m) \end{pmatrix}$

Basis Pursuit recovers all nonnegative ${s}$-sparse vectors ${x\in{\mathbb R}^m}$.

5. Concluding remarks

Note that Theorem~3 gives a deterministic construction of a measurement matrix.

Also note, that nonnegativity is crucial in what we did here. This allowed (in the monomial case) to work with squares and obtain the polynomial ${P}$ in the proof of Theorem~1 (which is also called “dual certificate” in this context). This raises the question how this method can be adapted to all sparse signals. One needs (in the monomial case) a polynomial which is bounded by 1 but matches the signs of the measure on its support. While this can be done (I think) for polynomials it seems difficult to obtain a generalization of Krein’s Theorem to this case…