### Signal and image processing

How many samples are needed to reconstruct a sparse signal?

Well, there are many, many results around some of which you probably know (at least if you are following this blog or this one). Today I write about a neat result which I found quite some time ago on reconstruction of nonnegative sparse signals from a semi-continuous perspective.

1. From discrete sparse reconstruction/compressed sensing to semi-continuous

The basic sparse reconstruction problem asks the following: Say we have a vector ${x\in{\mathbb R}^m}$ which only has ${s non-zero entries and a fat matrix ${A\in{\mathbb R}^{n\times m}}$ (i.e. ${n>m}$) and consider that we are given measurements ${b=Ax}$. Of course, the system ${Ax=b}$ is underdetermined. However, we may add a little more prior knowledge on the solution and ask: Is is possible to reconstruct ${x}$ from ${b}$ if we know that the vector ${x}$ is sparse? If yes: How? Under what conditions on ${m}$, ${s}$, ${n}$ and ${A}$? This question created the expanding universe of compressed sensing recently (and this universe is expanding so fast that for sure there has to be some dark energy in it). As a matter of fact, a powerful method to obtain sparse solutions to underdetermined systems is ${\ell^1}$-minimization a.k.a. Basis Pursuit on which I blogged recently: Solve

$\displaystyle \min_x \|x\|_1\ \text{s.t.}\ Ax=b$

and the important ingredient here is the ${\ell^1}$-norm of the vector in the objective function.

In this post I’ll formulate semi-continuous sparse reconstruction. We move from an ${m}$-vector ${x}$ to a finite signed measure ${\mu}$ on a closed interval (which we assume to be ${I=[-1,1]}$ for simplicty). We may embed the ${m}$-vectors into the space of finite signed measures by choosing ${m}$ points ${t_i}$, ${i=1,\dots, m}$ from the interval ${I}$ and build ${\mu = \sum_{i=1}^m x_i \delta_{t_i}}$ with the point-masses (or Dirac measures) ${\delta_{t_i}}$. To a be a bit more precise, we speak about the space ${\mathfrak{M}}$ of Radon measures on ${I}$, which are defined on the Borel ${\sigma}$-algebra of ${I}$ and are finite. Radon measures are not very scary objects and an intuitive way to think of them is to use Riesz representation: Every Radon measure arises as a continuous linear functional on a space of continuous functions, namely the space ${C_0(I)}$ which is the closure of the continuous functions with compact support in ${{]{-1,1}[}}$ with respect to the supremum norm. Hence, Radon measures work on these functions as ${\int_I fd\mu}$. It is also natural to speak of the support ${\text{supp}(\mu)}$ of a Radon measure ${\mu}$ and it holds for any continuous function ${f}$ that

$\displaystyle \int_I f d\mu = \int_{\text{supp}(\mu)}f d\mu.$

An important tool for Radon measures is the Hahn-Jordan decomposition which decomposes ${\mu}$ into a positive part ${\mu^+}$ and a negative part ${\mu^-}$, i.e. ${\mu^+}$ and ${\mu^-}$ are non-negative and ${\mu = \mu^+-\mu^-}$. Finally the variation of a measure, which is

$\displaystyle \|\mu\| = \mu^+(I) + \mu^-(I)$

provides a norm on the space of Radon measures.

Example 1 For the measure ${\mu = \sum_{i=1}^m x_i \delta_{t_i}}$ one readily calculates that

$\displaystyle \mu^+ = \sum_i \max(0,x_i)\delta_{t_i},\quad \mu^- = \sum_i \max(0,-x_i)\delta_{t_i}$

and hence

$\displaystyle \|\mu\| = \sum_i |x_i| = \|x\|_1.$

In this sense, the space of Radon measures provides a generalization of ${\ell^1}$.

We may sample a Radon measure ${\mu}$ with ${n+1}$ linear functionals and these can be encoded by ${n+1}$ continuous functions ${u_0,\dots,u_n}$ as

$\displaystyle b_k = \int_I u_k d\mu.$

This sampling gives a bounded linear operator ${K:\mathfrak{M}\rightarrow {\mathbb R}^{n+1}}$. The generalization of Basis Pursuit is then given by

$\displaystyle \min_{\mu\in\mathfrak{M}} \|\mu\|\ \text{s.t.}\ K\mu = b.$

This was introduced and called “Support Pursuit” in the preprint Exact Reconstruction using Support Pursuit by Yohann de Castro and Frabrice Gamboa.

More on the motivation and the use of Radon measures for sparsity can be found in Inverse problems in spaces of measures by Kristian Bredies and Hanna Pikkarainen.

2. Exact reconstruction of sparse nonnegative Radon measures

Before I talk about the results we may count the degrees of freedom a sparse Radon measure has: If ${\mu = \sum_{i=1}^s x_i \delta_{t_i}}$ with some ${s}$ than ${\mu}$ is defined by the ${s}$ weights ${x_i}$ and the ${s}$ positions ${t_i}$. Hence, we expect that at least ${2s}$ linear measurements should be necessary to reconstruct ${\mu}$. Surprisingly, this is almost enough if we know that the measure is nonnegative! We only need one more measurement, that is ${2s+1}$ and moreover, we can take fairly simple measurements, namely the monomials: ${u_i(t) = t^i}$ ${i=0,\dots, n}$ (with the convention that ${u_0(t)\equiv 1}$). This is shown in the following theorem by de Castro and Gamboa.

Theorem 1 Let ${\mu = \sum_{i=1}^s x_i\delta_{t_i}}$ with ${x_i\geq 0}$, ${n=2s}$ and let ${u_i}$, ${i=0,\dots n}$ be the monomials as above. Define ${b_i = \int_I u_i(t)d\mu}$. Then ${\mu}$ is the unique solution of the support pursuit problem, that is of

$\displaystyle \min \|\nu\|\ \text{s.t.}\ K\nu = b.\qquad \textup{(SP)}$

Proof: The following polynomial will be of importance: For a constant ${c>0}$ define

$\displaystyle P(t) = 1 - c \prod_{i=1}^s (t-t_i)^2.$

The following properties of ${P}$ will be used:

1. ${P(t_i) = 1}$ for ${i=1,\dots,s}$
2. ${P}$ has degree ${n=2s}$ and hence, is a linear combination of the ${u_i}$, ${i=0,\dots,n}$, i.e. ${P = \sum_{k=0}^n a_k u_k}$.
3. For ${c}$ small enough it holds for ${t\neq t_i}$ that ${|P(t)|<1}$.

Now let ${\sigma}$ be a solution of (SP). We have to show that ${\|\mu\|\leq \|\sigma\|}$. Due to property 2 we know that

$\displaystyle \int_I u_k d\sigma = (K\sigma)k = b_k = \int_I u_k d\mu.$

Due to property 1 and non-negativity of ${\mu}$ we conclude that

$\displaystyle \begin{array}{rcl} \|\mu\| & = & \sum_{i=1}^s x_i = \int_I P d\mu\\ & = & \int_I \sum_{k=0}^n a_k u_k d\mu\\ & = & \sum_{k=0}^n a_k \int_I u_k d\mu\\ & = & \sum_{k=0}^n a_k \int_I u_k d\sigma\\ & = & \int_I P d\sigma. \end{array}$

Moreover, by Lebesgue’s decomposition we can decompose ${\sigma}$ with respect to ${\mu}$ such that

$\displaystyle \sigma = \underbrace{\sum_{i=1}^s y_i\delta_{t_i}}_{=\sigma_1} + \sigma_2$

and ${\sigma_2}$ is singular with respect to ${\mu}$. We get

$\displaystyle \begin{array}{rcl} \int_I P d\sigma = \sum_{i=1}^s y_i + \int P d\sigma_2 \leq \|\sigma_1\| + \|\sigma_2\|=\|\sigma\| \end{array}$

and we conclude that ${\|\sigma\| = \|\mu\|}$ and especially ${\int_I P d\sigma_2 = \|\sigma_2\|}$. This shows that ${\mu}$ is a solution to ${(SP)}$. It remains to show uniqueness. We show the following: If there is a ${\nu\in\mathfrak{M}}$ with support in ${I\setminus\{x_1,\dots,x_s\}}$ such that ${\int_I Pd\nu = \|\nu\|}$, then ${\nu=0}$. To see this, we build, for any ${r>0}$, the sets

$\displaystyle \Omega_r = [-1,1]\setminus \bigcup_{i=1}^s ]x_i-r,x_i+r[.$

and assume that there exists ${r>0}$ such that ${\|\nu|_{\Omega_r}\|\neq 0}$ (${\nu|_{\Omega_r}}$ denoting the restriction of ${\nu}$ to ${\Omega_r}$). However, it holds by property 3 of ${P}$ that

$\displaystyle \int_{\Omega_r} P d\nu < \|\nu|_{\Omega_r}\|$

and consequently

$\displaystyle \begin{array}{rcl} \|\nu\| &=& \int Pd\nu = \int_{\Omega_r} Pd\nu + \int_{\Omega_r^C} P d\nu\\ &<& \|\nu|_{\Omega_r}\| + \|\nu|_{\Omega_r^C}\| = \|\nu\| \end{array}$

which is a contradiction. Hence, ${\nu|_{\Omega_r}=0}$ for all ${r}$ and this implies ${\nu=0}$. Since ${\sigma_2}$ has its support in ${I\setminus\{x_1,\dots,x_s\}}$ we conclude that ${\sigma_2=0}$. Hence the support of ${\sigma}$ is exactly ${\{x_1,\dots,x_s\}}$. and since ${K\sigma = b = K\mu}$ and hence ${K(\sigma-\mu) = 0}$. This can be written as a Vandermonde system

$\displaystyle \begin{pmatrix} u_0(t_1)& \dots &u_0(t_s)\\ \vdots & & \vdots\\ u_n(t_1)& \dots & u(t_s) \end{pmatrix} \begin{pmatrix} y_1 - x_1\\ \vdots\\ y_s - x_s \end{pmatrix} = 0$

which only has the zero solution, giving ${y_i=x_i}$. $\Box$

3. Generalization to other measurements

The measurement by monomials may sound a bit unusual. However, de Castro and Gamboa show more. What really matters here is that the monomials for a so-called Chebyshev-System (or Tchebyscheff-system or T-system – by the way, have you ever tried to google for a T-system?). This is explained, for example in the book “Tchebycheff Systems: With Applications in Analysis and Statistics” by Karlin and Studden. A T-system on ${I}$ is simply a set of ${n+1}$ functions ${\{u_0,\dots, u_n\}}$ such that any linear combination of these functions has at most ${n}$ zeros. These systems are called after Tchebyscheff since they obey many of the helpful properties of the Tchebyscheff-polynomials.

What is helpful in our context is the following theorem of Krein:

Theorem 2 (Krein) If ${\{u_0,\dots,u_n\}}$ is a T-system for ${I}$, ${k\leq n/2}$ and ${t_1,\dots,t_k}$ are in the interior of ${I}$, then there exists a linear combination ${\sum_{k=0}^n a_k u_k}$ which is non-negative and vanishes exactly the the point ${t_i}$.

Now consider that we replace the monomials in Theorem~1 by a T-system. You recognize that Krein’s Theorem allows to construct a “generalized polynomial” which fulfills the same requirements than the polynomial ${P}$ is the proof of Theorem~1 as soon as the constant function 1 lies in the span of the T-system and indeed the result of Theorem~1 is also valid in that case.

4. Exact reconstruction of ${s}$-sparse nonnegative vectors from ${2s+1}$ measurements

From the above one can deduce a reconstruction result for ${s}$-sparse vectors and I quote Theorem 2.4 from Exact Reconstruction using Support Pursuit:

Theorem 3 Let ${n}$, ${m}$, ${s}$ be integers such that ${s\leq \min(n/2,m)}$ and let ${\{1,u_1,\dots,u_n\}}$ be a complete T-system on ${I}$ (that is, ${\{1,u_1,\dots,u_r\}}$ is a T-system on ${I}$ for all ${r). Then it holds: For any distinct reals ${t_1,\dots,t_m}$ and ${A}$ defined as

$\displaystyle A=\begin{pmatrix} 1 & \dots & 1\\ u_1(t_1)& \dots &u_1(t_m)\\ \vdots & & \vdots\\ u_n(t_1)& \dots & u(t_m) \end{pmatrix}$

Basis Pursuit recovers all nonnegative ${s}$-sparse vectors ${x\in{\mathbb R}^m}$.

5. Concluding remarks

Note that Theorem~3 gives a deterministic construction of a measurement matrix.

Also note, that nonnegativity is crucial in what we did here. This allowed (in the monomial case) to work with squares and obtain the polynomial ${P}$ in the proof of Theorem~1 (which is also called “dual certificate” in this context). This raises the question how this method can be adapted to all sparse signals. One needs (in the monomial case) a polynomial which is bounded by 1 but matches the signs of the measure on its support. While this can be done (I think) for polynomials it seems difficult to obtain a generalization of Krein’s Theorem to this case…

This last post on uncertainty principles will be probably the hardest one for me. As said in my first post, I supervised a Master’s thesis and posed the very vague question

“Why are the uncertainty principles for the windowed Fourier transform and the wavelet transform so different?”

I had different things in mind:

• The windowed Fourier transform can be generalized to arbitrary dimensions easily. Especially, the underlying Weyl-Heisenberg group can be generalized to arbitrary dimensions. Interestingly, the uncertainty principle carries over almost exactly: For the windowed Fourier transform in ${d}$ dimensions, the uncertainty principle reads as

$\displaystyle \text{Var}_g\cdot\text{Var}_{\hat g}\geq \tfrac{d^2}{4}$

and again, this inequality is sharp for the multivariate Gaussians. A generalization of the wavelet transform is by no means canonical. The sprit in one dimension was to use translation and scaling. However, in higher dimensions there are a lot more geometric transformations you can apply: rotations, anisotropic scalings and shearing. Here one has to identify a suitable group of actions and try to carry all things over. The most naive way, which uses isotropic scaling and rotation does lead to uncertainty relations but no function will make these inequalities sharp…

• The lower bound in the Heisenberg uncertainty principle is fixed (for normed ${g}$). However, the lower bound in the affine uncertainty (equation (1) in my previous post) is not fixed (for normed ${f}$). Indeed ${|\langle f',f\rangle|}$ can be arbitrarily small. Hence, a function which makes the inequality sharp may not lead to the minimum product of the corresponding operator variances. For other wavelet-like transformations (i.e. they include some kind of scaling) this is the same.
• The Heisenberg uncertainty principle has a clear and crisp interpretation involving the product of the variances for a function and its Fourier transform. There is no such thing available for the affine uncertainty principle. (In fact, this question was not addressed in the thesis but in the paper “Do uncertainty minimizers attain minimal uncertainty” and the Diploma thesis by Bastian Kanning).

The outcome was the (german) thesis Unschärferelationen für unitäre Gruppendarstellungen (Uncertainty relations for unitary group representations) by Melanie Rehders. As the question is so vague, there could not be one simple answer, but as a result of the thesis, one could say in a nutshell:

“The uncertainty principles are so different because the groups underlying wavelet-like transforms are semidirect products of a matrix group and ${\mathbb{R}^d}$ and hence, the identity can not be an infinitesimal generator and hence, not be a commutator”.

In this post I’ll face the challenge to give some meaning to this sentence.

1. The abstract structure behind

Let me introduce the players in a diagram which I redraw from the thesis:

As you see, we need several algebraic structures (as well as analytical ones).

2. From group representations to integral transforms

First, we need a locally compact group ${G}$, and naturally, this comes with a left invariant measure ${\mu}$, which is called Haar measure. With these tool we can intergrate complex valued functions defined of the group: ${\int_G f(x)d\mu}$ and we may also form the spaces ${L^p(G)}$.

Having the space ${L^2(G)}$, we can define a special representation of the group (remember that a group representation is a description of the group in terms of linear transformations of a vector space, in other words, a group homomorphism from ${G}$ to the space ${GL(V)}$ of linear mappings on some vector space ${V}$). The special representation we use the the so called left regular representation on the space of unitary operators on the space ${L^2(G)}$ (denoted by ${\mathcal{U}(L^2(G))}$. This representation is the mapping ${\pi:G\rightarrow \mathcal{U}(L^2(G))}$ defined by

$\displaystyle \pi(a) f(x) = f(a^{-1}x).$

One easily checks, that this is a homomorphism and the unitarity follows from the left invariance of the Haar measure. One could say, that the group ${G}$ acts on the functions ${f}$ in ${L^2(G)}$ in a unitary way. We now may define an integral transform as follows: For ${\psi\in L^2(G)}$ define

$\displaystyle V_\psi f(a) = \langle f,\pi(a)\psi\rangle_{L^2(G)}. \ \ \ \ \ (1)$

You may compare with the previous two posts, that this gives precisely the windowed fourier transform (for the Weyl-Heisenberg group) and the wavelet transform (for the affine group).

To have convenient properties for the integral transform one need some more conditions

1. Irreducibility, i.e. that the only subspaces of ${L^2(G)}$ which are invariant under every ${\pi(a)}$ are ${\{0\}}$ and ${L^2(G)}$.
2. Square integrability, i.e. that there exists a non-zero function ${\psi\in L^2(G)}$ such that

$\displaystyle \int_G |\langle \psi,\pi(a)\psi\rangle|^2 d\mu < \infty;$

these functions ${\psi}$ are called admissible.

We have the following theorem: \href{Grossmann, Morlet, Paul}} Let ${\pi}$ be a unitary, irreducible, and square integrable representation of a locally compact group ${G}$ on ${L^2(G)}$ and let ${\psi}$ be admissible. Then it holds that the mapping ${V_\psi}$ defined in (1) is a multiple of an isometry. Especially, ${V_\psi}$ has a left-inverse which is (up to a constant) given by its adjoint.

This somehow clarifies the arrow from “group representation” to “integral transform”.

3. From group representations to Lie algebra representations

For a closed linear group ${G}$, i.e. a closed subgroup of ${GL(d,\mathbb{R})}$, one has the associated Lie-Algebra ${\mathfrak{g}}$ defined with the help of the matrix exponential by

$\displaystyle \mathfrak{g} = \{ X\ |\ \text{for all}\ t:\ \exp(tX)\in G\}.$

The corresponding Lie-bracket is the commutator:

$\displaystyle [X,Y] = XY-YX.$

If we now have a representation of our group ${G}$ on some Hilbert space ${H}$ (you may think of ${H = L^2(G)}$ but here we may have any Hilbert space), we may ask if there is an associated representation of the Lie-Algebra ${\mathfrak{g}}$. Indeed there is one which is called the derived representation. To formulate this representation we need the following subspace of ${H}$:

$\displaystyle H_\pi^\infty = \{f\in H\ |\ a\mapsto \pi(a)f\ \text{is a}\ C^\infty\ \text{mapping}\}.$

Theorem 1 Let ${\pi}$ be a representation of a closed linear group ${G}$ in a Hilbert space ${H}$. The mapping ${d\pi}$ defined by

$\displaystyle d\pi(X)f = \lim_{t\rightarrow 0}\frac{\pi(\exp(tX))f - f}{t}$

is a representation of the Lie-Algebra ${\mathfrak{g}}$ on the space ${H_\pi^\infty}$.

This clarifies the arrow from “group representations” to Lie algebra representations.

4. Lie-algebra representations and uncertainty relations

We are now ready to illustrate the abstract path from Lie-algebra representations to uncertainty relations. This path uses the so called infinitesimal generators:

Definition 2 Let ${G}$ be a closed linear group with Lie algebra ${\mathfrak{g}}$ and let ${{X_1,\dots,X_m}}$ be a basis of ${\mathfrak{g}}$. Let ${\pi}$ be a representation of ${G}$ on a complex Hilbert space ${H}$ and let the derived representation ${d\pi}$ be injective. Then, the operators ${T_j= \mathrm{i} d\pi(X_j)}$ are called the infinitesimal generators of ${G}$ with respect to the representation ${\pi}$.

These infinitesimal generators are always self-adjoint. Hence, we may apply Robertson’s uncertainty principle for every two infinitesimal generators for which the commutator does not vanish.

The abstract way, described in the Sections 2, 3 and 4 is precisely how we have derived the Heisenberg uncertainty principle and the affine uncertainty principle in the two previous posts. But now the question remains: Why are they so different?

The so-called commutator tables of the Lie-algebras shed some light on this:

Example 1 (The Heisenberg algebra) The associated Lie algebra to the Weyl-Heisenberg group is the real vector space ${{\mathbb R}^2 \times \mathrm{i} {\mathbb R}}$ with the Lie bracket

$\displaystyle [(\omega,b,i\phi),(\omega',b',\mathrm{i}\phi')] = (0,0,\mathrm{i}(\omega b' - b'\omega)).$

A basis of this Lie algebra is ${(1,0,0)}$, ${(0,1,0)}$, ${(0,0,\mathrm{i})}$ and the three commutators are

$\displaystyle [(1,0,0),(0,1,0)] = (0,0,2\mathrm{i})$

$\displaystyle [(1,0,0),(0,0,\mathrm{i})]=(0,0,0)$

$\displaystyle [(0,1,0),(0,0,\mathrm{i})] = (0,0,0).$

Two facts are important: There is an element which commutes with every other element. In other words: The center of the algebra is one-dimensional and spanned by one of the basis elements. If we remember the three infinitesimal generators ${\mathrm{i} T_\omega}$, ${\mathrm{i} T_b}$ and ${\mathrm{i} T_\tau}$ for the windowed Fourier transform, we observe that they obey the same commutator relations (which is not a surprise…).

Example 2 (The “affine Lie algebra”) The Lie algebra of the affine group ${({\mathbb R}\setminus\{0\})\times {\mathbb R}}$ (with composition ${(a,b)(a',b') = (aa',ab'+b)}$) is ${{\mathbb R}\times {\mathbb R}}$ with Lie bracket

$\displaystyle [(x,y),(x',y')] = (0,xy'-x'y).$

A basis of the Lie algebra is ${(1,0)}$, ${(0,1)}$ and the commutator is

$\displaystyle [(1,0),(0,1)] = (0,1).$

Here, there is no element which commutes with everything, i.e. the center of the Lie algebra is trivial. Of course, the commutator relation resembles the one for the infinitesimal generators ${\mathrm{i} T_a}$ and ${\mathrm{i} T_b}$ for the wavelet transform.

5. Higher dimensional wavelets

Wavelets in higher dimensions are a bit tricky. If one thinks of groups acting on ${{\mathbb R}^d}$ which consist of translation and some thing as dilation one observes that one basically deals with semidirect products of a subgroup ${D}$ of ${GL(d,{\mathbb R})}$ and ${{\mathbb R}^d}$: For ${A\in D}$ and ${b\in{\mathbb R}^d}$ one may transform a function ${f:{\mathbb R}^d\rightarrow{\mathbb C}}$ as

$\displaystyle \pi(A,b)f(x) = |\det(A)|^{-1/2}f(A^{-1}x-b). \ \ \ \ \ (2)$

Indeed this the so called quasiregular representation of the semidirect product of ${D}$ and ${{\mathbb R}^d}$. Two important examples of 2-dimensional wavelet-like transformations are:

Example 3 The “standard” 2-dimensional wavelet transform. One takes the group

$\displaystyle D = \left\{ \begin{bmatrix} a_1 & -a_2\\ a_2 & a_1 \end{bmatrix}\ :\ a_1,a_2\neq 0 \right\}$

which is a combination of rotation and isotropic scaling. Another parametrization is:

$\displaystyle \left\{ a \begin{bmatrix} \cos(\phi) &-\sin(\phi)\\ \sin(\phi) & \cos(\phi) \end{bmatrix}\ :\ a> 0,\ \phi\in [0,2\pi{[}\right\}$

where ${a}$ is the scaling factor and ${\phi}$ is the rotation angle.

Example 4 The shearlet transform bases of the group

$\displaystyle D = \left\{ \begin{bmatrix} a & \sqrt{a}\,s\\ 0 & \sqrt{a} \end{bmatrix}\ : s\in {\mathbb R},\ a>0 \right\}$

which consists of anisotropic scaling by ${a}$ and “shear” by ${s}$.

Doing some more algebra, one observes that the center of the associated Lie algebra of the semidirect product of the form (2) is always trivial and hence, the identity never appears as a commutator. This neat observation shows, that no wavlet-like transformation which bases on a group structure can ever have any uncertainty relation which behaves like

$\displaystyle c\|f\|\leq \text{some product of variances of operators}$

as in the Heiseberg case.

Although this may not be a groundbreaking discovery, this observation and the whole underlying algebra somehow cleared my view on this issue.

1. The affine group behind the wavelet transform

Continuing my previous post on the uncertainty principle for the windowed Fourier transform, we now come to another integral transform: The wavelet transform.

In contrast to the windowed Fourier transform (which analyzes a function with respect to position and frequency) the wavelet transform analyzes a function with respect to position and scale. For a given analyzing function ${\psi}$ and a signal ${f}$, the wavelet transform is (for ${a\neq 0}$, ${b\in{\mathbb R}}$):

$\displaystyle W_\psi f(a,b) = \int_{\mathbb R} f(x) \tfrac{1}{\sqrt{|a|}}\psi(\tfrac{x-b}{a})dx.$

In the same way, the windowed Fourier transform could be written as inner products of ${f}$ with a translated and modulated window function, the wavelet transform can be written as inner products of ${f}$ with translated and scaled functions ${\psi}$. And again, these modifications which happen to the analyzing function come from a group.

Definition 1 The affine group ${G_{\text{aff}}}$ is the set ${({\mathbb R}\setminus\{0\})\times {\mathbb R}}$ endowed with the operation

$\displaystyle (a,b)(a',b') = (a\,a',a\,b' + b).$

Indeed this is group (with identity ${(1,0)}$ and inverse ${(a,b)^{-1} = (a^{-1},a^{-1}b)}$). The name affine group stems from the fact that the group operation behaves like the composition of one dimensional affine linear functions: For ${f(x) = ax+b}$ and ${g(x) = a'\,x+b'}$ we have ${f(g(x)) = (a\,a')\, x + a\,b' + b}$.

The affine group admits a representation on the space of unitary operators on ${L^2({\mathbb R})}$:

$\displaystyle \Pi(a,b)f(x) = \tfrac{1}{\sqrt{|a|}}\psi(\tfrac{x-b}{a})$

(note the normalizing factor ${1/\sqrt{|a|}}$).

2. The affine uncertainty principle

I am not sure who has to credited for the group theoretical background behind wavelets however, the two-part paper “Transforms associated to square integrable group representations” by Grossmann, Morlet and Paul has been influential (and can be found, e.g. in the compilation “Fundamental papers in wavelet theory” by Heil and Walnut.

As done by Stephan Dahlke and Peter Maass in “The Affine Uncertainty Principle in One and Two Dimensions” and can proceed in analogy to the windowed Fourier transform and the corresponding Weyl-Heisenberg group and compute the infinitesimal operators: Take the derivative of the representation with the respect to the group parameters and evaluate at the identity:

$\displaystyle T_a f(x) := \frac{d}{da}[\Pi(a,b)f(x)|_{(a,b) = (1,0)} = -\tfrac{1}{2}f(x) - xf'(x)$

and

$\displaystyle T_b f(x) := \frac{d}{db}[\Pi(a,b)f(x)|_{(a,b) = (1,0)} = -f'(x).$

Again, these operators are skew adjoint and hence, multiplying by ${\mathrm{i}}$ gives self-adjoint operators.

These operators ${\mathrm{i} T_b}$ and ${\mathrm{i} T_a}$ do not commute and hence, applying Robertson’s uncertainty principle gives an inequality. The commutator of ${\mathrm{i} T_a}$ and ${\mathrm{i} T_b}$ is

$\displaystyle [\mathrm{i} T_a,\mathrm{i} T_b] f(x) = f'(x).$

$\displaystyle \tfrac{1}{2}|\langle f',f\rangle| \leq \|(\mathrm{i} T_a - \mu_1 I)f\|\,\| (\mathrm{i} T_b - \mu_2 I)f\| \ \ \ \ \ (1)$

and with some manipulation this turn to (for ${\|f\|=1}$)

$\displaystyle \tfrac{1}{4}|\langle f',f\rangle|\leq (\|f'\|^2 - \mu_2^2)(\|xf'\|^2 -\tfrac{1}{4} - \mu_1^2). \ \ \ \ \ (2)$

Again, one can derive the functions for which equality in attained and these are the functions of the form

$\displaystyle f(x) = c(x-\mathrm{i} \lambda) ^{-1/2 + \mathrm{i}\lambda\mu_2+\mathrm{i}\mu_1}$

for real ${\lambda}$. (By the way, these functions are indeed wavelets and sometimes called Cauchy-wavelets because of their analogy with the Cauchy kernel from complex analysis.)

By the way: These functions are necessarily complex valued. If one restricts oneself to real valued functions there is a simpler inequality, which one may call “real valued affine uncertainty”. First, observe that ${\langle f',f\rangle = 0}$ for real valued ${f}$, and hence, the left hand side in (1) is zero (which make the inequality a bit pointless). Using that for real valued ${f}$ we have ${\langle T_a f,f\rangle = 0}$, and that ${\|T_a f\|^2 = \|xf'\| - \tfrac{1}{4}\|f\|^2}$ together with ${\|(\mathrm{i} T_b -\mu_2)f\|^2\neq 0}$ for ${f\neq 0}$ we obtain (with ${\mu_1=0}$) from (1)

$\displaystyle \tfrac{1}{2} \|f\| \leq \|xf'\|.$

Since we know that equality is only attained for the Cauchy wavelets (which are not real valued we can state:

Corollary 2 (Real valued affine uncertainty) For any real valued function which is in the domain of ${[T_a,T_b]}$ it holds that

$\displaystyle \tfrac{1}{2} \|f\| < \|xf'\|.$

As some strange curiosity, one can derive this “real valued affine uncertainty principle” by formal integration by parts and Cauchy-Schwarz inequality totally similar to the Heisenberg uncertainty principle (as I’ve done in my previous post):

$\displaystyle \begin{array}{rcl} \|g\|_2^2 & =& \int_{\mathbb R} 1\cdot|g(x)|^2dx\\ & = &-\int_{\mathbb R} x\tfrac{d}{dx}|g(x)|^2dx\\ & = &-2\int_{\mathbb R} xg(x)g'(x)dx\\ & \leq &2\int_{\mathbb R} |xg'(x)|\,|g(x)|dx\\ & \leq &2\Big(\int_{\mathbb R} |xg'(x)|^2dx\Big)^{1/2}\Big(\int_{\mathbb R} |g(x)|^2dx\Big)^{1/2}. \end{array}$

Dividing by ${2\|g\|}$ gives the “real valued affine uncertainty” (but only in the non-strict way).

Some years ago I became fascinated by uncertainty principles. I got to know them via signal processing and not via physics, although, from a mathematical point of view they are the same.

I recently supervised a Master’s thesis on this topic and the results clarified a few things for me which I used to find obscure and I’d like to illustrate this here on my blog. However, it takes some space to introduce notation and to explain what it’s all about and hence, I decided to write a short series of posts, I try to explain, what new insights I got from the thesis. Here comes the first post:

1. The Fourier transform and the windowed Fourier transform

Let’s start with an important tool from signal processing you all know: The Fourier transform. For ${f:{\mathbb R}\rightarrow{\mathbb C}}$ the Fourier transform is

$\displaystyle \hat f(\omega) = (2\pi)^{-1/2}\int_{\mathbb R} f(x) \mathrm{e}^{-\mathrm{i} x\omega} dx.$

(I was tempted to say “whenever the integral is defined”. However, the details here are a little bit more involved, but I will not go into detail here; ${hat f}$ is defined for ${L^2}$-functions, for ${L^1}$ functions and even for tempered distributions…) Roughly speaking, the Fourier transform decomposes a signal into its frequency components, which can be seen from the Fourier inversion formula:

$\displaystyle f(x) = (2\pi)^{-1/2}\int_{\mathbb R} \hat f(\omega) \mathrm{e}^{\mathrm{i} x\omega} d\omega,$

i.e. the (complex) number ${f(\omega)}$ says “how much the frequency ${\omega}$ (i.e. the function ${x\mapsto \mathrm{e}^{\mathrm{i} x\omega}}$) contributes to ${f}$”. In the context of signal processing one often speaks of the “time representation” ${f}$ and the “frequency representation” ${\hat f}$.

One drawback of the Fourier transform, when used to analyze signals, is its “global” nature in that the value ${\hat f(\omega)}$ depends on every value of ${f}$, i.e. a change of ${f}$ in a small interval results a change of all of ${\hat f}$. A natural idea (which is usually attributed to Gabor) is, to introduce a window function ${g}$ which is supposed to be a bump function, centered at zero, then translate this function and “localize” ${f}$ by multiplying it with ${g(\cdot-t)}$. The resulting transform

$\displaystyle G_gf(\omega,t) = (2\pi)^{-1/2}\int_{\mathbb R} f(x)g(x-t)\mathrm{e}^{-ix\omega}dx$

is called windowed Fourier transform, short-time Fourier transform or (in the case of ${g(x) = \mathrm{e}^{-x^2/2}}$) Gabor transform.

Of course we can write the windowed Fourier transform in term of the usual Fourier transform as

$\displaystyle G_gf(\omega,t) = \widehat{(f\,g(\cdot-t))}(\omega). \ \ \ \ \ (1)$

In other words: The localization in time is precisely determined by the “locality” of ${g}$, that is, how well ${g}$ is concentrated around zero. The better ${g}$ is concentrated around ${0}$, the more “localized around ${t}$” is the information of ${f}$, the windowed Fourier transform ${G_gf(\omega,t)}$ uses.

For the localization in frequency one obtains (by Plancherel’s formula and integral substitution) that

$\displaystyle G_gf(\omega,t) = \mathrm{e}^{-\mathrm{i} x\omega}\widehat{(\hat f\,\hat g(\cdot-\omega)}(-x).$

In other words: The localization in frequency is precisely determined by the “locality” of ${\hat g}$, that is, how well ${\hat g}$ is concentrated around zero. The better ${\hat g}$ is concentrated around ${0}$, the more “localized around ${\omega}$” is the information of ${\hat f}$, the windowed Fourier transform ${G_gf(\omega,t)}$ uses.

Hence, it seems clear that a function ${g}$ is well suited as a window function, if it both well localized in time and frequency. If one measures the localization of a function around zero by its variance

$\displaystyle \text{Var}_g = \int_{\mathbb R} x^2|g(x)|^2 dx$,

then there is the fundamental lower bound on the product of the variance of a function and the variance of its Fourier transform, know under the name “Heisenberg uncertainty principle” (or, as I learned from Wikipedia, “Gabor limit”): For ${\|g\|_{L^2}=1}$ it holds that

$\displaystyle \text{Var}_g\cdot\text{Var}_{\hat g}\geq \tfrac14.$

Proof: A simple (not totally rigorous) proof goes like this: We use partial integration, the Cauchy-Schwarz inequality and the Plancherel formula:

$\displaystyle \begin{array}{rcl} \|g\|_2^2 & =& \int_{\mathbb R} 1\cdot|g(x)|^2dx\\ & = &-\int_{\mathbb R} x\tfrac{d}{dx}|g(x)|^2dx\\ & = &-2\int_{\mathbb R} xg(x)g'(x)dx\\ & \leq &2\int_{\mathbb R} |xg(x)|\,|g'(x)|dx\\ & \leq &2\Big(\int_{\mathbb R} |xg(x)|^2dx\Big)^{1/2}\Big(\int_{\mathbb R} |g'(x)|^2dx\Big)^{1/2}\\ & = &2\Big(\int_{\mathbb R} |xg(x)|^2dx\Big)^{1/2}\Big(\int_{\mathbb R} |\omega \hat g(\omega)|^2d\omega\Big)^{1/2}\\ & = &2(\text{Var}_g)^{1/2}(\text{Var}_{\hat g})^{1/2}. \end{array}$

$\Box$

Moreover, the inequality is sharp for the functions ${g(x) = C\mathrm{e}^{-\lambda x^2}}$ for ${\lambda>0}$. In this sense, these Gaussians are best suited for the windowed Fourier transform.

While this presentation was geared towards usability, there is a quite different approach to uncertainty principles related to integral transforms which uses the underlying group structure.

2. The group behind the windowed Fourier transform

The windowed Fourier transform (1) can also be seen as taking inner products of ${f}$ with the family of functions ${g(\cdot-t)\mathrm{e}^{-\mathrm{i} x\omega}}$. This family is obtained from the single function ${g}$ by letting the so-called Weyl-Heisenberg group act on it:

Definition 1 The Weyl-Heisenberg group ${G_{WH}}$ is the set ${{\mathbb R}\times{\mathbb R}\times S^1}$ endowed with the operation

$\displaystyle (\omega,b,\tau)(\omega',b'\tau') = (\omega+\omega',b+b',\tau\tau'\mathrm{e}^{\mathrm{i} (\omega b'-\omega' b)/2)}).$

The Weyl-Heisenberg group admits a representation of the space of unitary operators on ${L^2({\mathbb R})}$, that is a map ${\Pi:G_{WH}\rightarrow U(L^2({\mathbb R}))}$

$\displaystyle \Pi(\omega,b,\tau)f(x) = \tau\mathrm{e}^{-\mathrm{i} \omega b/2}\mathrm{e}^{\mathrm{i}\omega x}f(x-b).$

It indeed the operators ${\Pi(\omega,b,\tau)}$ are unitary and it holds that

$\displaystyle \Pi(\omega,b,\tau)\Pi(\omega',b',\tau')f(x) = \Pi((\omega,b,\tau)(\omega',b',\tau'))f(x).$

Moreover, the mapping ${(\omega,b,\tau)\mapsto \Pi(\omega,b,\tau)f}$ is continuous for all ${f}$, a property of the representation which is called strong continuity.

In this light, the windowed Fourier transform can be written as

$\displaystyle G_g f(\omega,t) = (2\pi)^{-1/2} \langle f,\Pi(\omega,b,\mathrm{e}^{\mathrm{i}\omega b/2})g\rangle.$

Now there is a motivation for the uncertainty principle as follows: Associated to the Weyl-Heisenberg group there is the Weyl-Heisenberg algebra, a basis of which is given by the so called infinitesimal generators. These are, roughly speaking, the derivatives of the representation with respect to the group parameters, evaluated at the identity. In the Weyl-Heisenberg case:

$\displaystyle T_\omega f(x) := \frac{d}{d\omega}[\Pi(\omega,b,\tau)f(x)|_{(\omega,b,\tau) = (0,0,1)} = \mathrm{i} xf(x)$

and

$\displaystyle T_b f(x) := \frac{d}{db}[\Pi(\omega,b,\tau)f(x)|_{(\omega,b,\tau) = (0,0,1)} = -f'(x)$

and

$\displaystyle T_\tau f(x) := \frac{d}{d\tau}[\Pi(\omega,b,\tau)f(x)|_{(\omega,b,\tau) = (0,0,1)} = \mathrm{i} f(x).$

(In the last case, my notation was not too good: Note that ${\tau = \mathrm{e}^{\mathrm{i} t}}$ with ${t\in[0,2\pi[}$ and the derivative has to be taken with respect to ${t}$.)

All these operators are skew adjoint on ${L^2({\mathbb R})}$ and hence the operators

$\displaystyle \mathrm{i} T_\omega f(x) = -xf(x),\quad \mathrm{i} T_b f(x) = -\mathrm{i} f'(x),\quad \mathrm{i} T_\tau f(x) = -f(x)$

For any two (possibly unbounded) operators on a Hilbert space there is a kind of abstract uncertainty principle (apparently sometimes known as Robertson’s uncertainty principle. It uses the commutator ${[A,B] = AB-BA}$ of two operators:

Theorem 2 For any two self adjoint operators ${A}$ and ${B}$ on a Hilbert space it holds that for any ${f}$ in the domain of definition of ${[A,B]}$ and any real numbers ${\mu_1}$ and ${\mu_2}$ it holds that

$\displaystyle \tfrac12|\langle [A,B] f,f\rangle|\leq \|(A-\mu_1 I)f\|\,\|(B-\mu_2)f\|.$

Proof: The proof simply consists of noting that

$\displaystyle \begin{array}{rcl} |\langle [A,B] f,f\rangle| &=& |\langle B f,Af\rangle -\langle Af,BS f\rangle|\\ & =& |\langle (B-\mu_2 I) f,(A-\mu_1 I)f\rangle -\langle (A-\mu_1 I)f,(B-\mu_2 I)S f\rangle|\\ & =& |2\text{Im} \langle (B-\mu_2 I) f,(A-\mu_1 I)f\rangle|\\ & \leq &2|\langle (B-\mu_2 I) f,(A-\mu_1 I)f\rangle|. \end{array}$

Now use Cauchy-Schwarz to obtain the result. $\Box$

Looking closer at the inequalities in the proof, one infers in which cases Robertson’s uncertainty principle is sharp: Precisely if there is a real ${\lambda}$ such that

$\displaystyle (A-\mu_1 I)f = -\mathrm{i}\lambda (B-\mu_2 I)f. \ \ \ \ \ (2)$

Now the three self-adjoint operators ${\mathrm{i} T_\omega}$, ${\mathrm{i} T_b}$ and ${\mathrm{i} T_\tau}$ have three commutators but since ${\mathrm{i} T_\tau}$ is a multiple of the identity and hence commutes with the others. I.e. there is only one commutator relation:

$\displaystyle [\mathrm{i} T_b,\mathrm{i} T_\omega] f = \mathrm{i} f.$

Hence, using ${A=\mathrm{i} T_b}$, ${B = \mathrm{i} T_\omega}$ and ${\mu_1=\mu_2=0}$ in Robertson’s uncertainty principle gives (in sloppy notation)

$\displaystyle \tfrac12 \|f\|^2\leq \|f'\|_2\|xf\|_2$

which is exactly the Heisenberg uncertainty principle.

Moreover, by (2), equality happens if ${f}$ fulfills the differential equation

$\displaystyle -\mathrm{i} f'(x) = \mathrm{i}\lambda x f(x)$

the solution of which are exactly the functions ${f(x) = C\mathrm{e}^{-\lambda x^2/2}}$.

Since the Heisenberg uncertainty principle is such an impressing thing with a broad range of implications (a colleague said that its interpretation, consequences and motivation somehow form a kind of “holy grail” in some communities), one may try to find other cool uncertainty principles be generalizing the approach of the previous sections to other transformations.

In the next post I am going to write about other group related integral transforms and its “uncertainty principles”.

L1TestPack has just been updated to version 1.1. With the help of Andreas Tillmann I enhanced this small gadget for issues related to ${\ell^1}$ minimization. New functions are

• Routines to directly calculate a source element for a given matrix ${A}$ and a vector ${x^\dagger}$, that is, calculate a vector ${y}$ such that

$\displaystyle A^* y \in\partial\|x^\dagger\|_1.$

The existence of such a vector ${y}$ ensures that the minimization problem (the Basis Pursuit problem)

$\displaystyle \min_x \|x\|_1\ \text{ s.t. }\ Ax = Ax^\dagger$

has the unique solution ${x^\dagger}$ (is other words: ${x^\dagger}$ is recovered exactly). This is particularly helpful is you are interested in unique solutions for Basis pursuit without posing strong conditions which even imply ${\ell^0}$-${\ell^1}$-equivalence.

• Routines related to RIP constants, the ERC coefficient of Joel Tropp and the mutual coherence.
• An implementation of the heuristic support evaluation HSE (also described in my previous post). (By the way: We were tempted to call this device “support evaluation routine” with acronym SuppER but abandoned this idea.)

Recently Andreas Tillmann presented the poster “An Infeasible-Point Subgradient Algorithm and a Computational Solver Comparison for l1-Minimization” at SPARS11. This poster summarized some results of the project SPEAR on sparse exact and approximate recovery of Marc Pfetsch an myself. We used this as an opportunity to release a draft of the accompanying paper with the same title. Although this draft is not totally ready to be submitted yet, I already summarize its content here.

Is this paper we considered the Basis Pursuit problem (beware: the linked Wikipedia page is stub at this time) from a purely optimization point of view. The Basis Pursuit problem is: For given matrix ${A\in{\mathbb R}^{m\times n}}$ (with ${m) and a vector ${b\in{\mathbb R}^m}$, find the solution to

$\displaystyle \min_{x} \|x\|_1\quad\text{s.t.}\quad Ax = b. \ \ \ \ \ (1)$

Hence, we mainly neglected all its interesting features of reproducing the sparsest solution of an underdetermined linear system and so on and solely concentrated on its solution as an optimization problem.

The paper has three somehow separated contributions:

• The new algorithm ISAL1: The problem (1) is a convex nonsmooth constrained optimization problem. Marc and Andreas are optimizers and they wondered how the most basic method for this class of problems would perform: The projected subgradient method: For solving

$\displaystyle \min_x f(x)\quad\text{s.t.}\quad x\in C$

take steps along some negative subgradient and project back to ${C}$: ${x^{k+1} = P_C(x^k - \alpha_k h^k)}$. For (1) subgradients are readily available, e.g. ${h^k = \text{sgn}(x^k)}$ (taken coordinate-wise). However, projecting onto the constraint ${Ax=b}$ is not too easy. Denoting the projection simply by ${P}$, we can give a closed form expression (assuming that ${A}$ has full rank) as

$\displaystyle P(z) = (I - A^T (AA^T)^{-1} A) z + A^T(AA^T)^{-1}b,$

this has the drawback that one needs to explicitly invert a matrix (which, however, is just ${m\times m}$ and hence, is usually not too large since we assume ${m<). However,  we proposed replace the exact projection by an approximate one: In each step we solve for the projection by a truncated conjugate gradient method. While we expected that one should increase the accuracy of the approximate projection by increasing the number of CG-steps during iteration, surprisingly that is not true: Throughout the iteration, a fixed small number of iterations (say ${5}$ for matrices of size ${1000\times 4000}$ but mainly independently of the size) suffices to obtain convergence (and especially feasibility of the iterates). In this paper we give a proof of convergence of the methods under several assumptions on the step-sizes and projection accuracies building on our previous paper in which we analyzed this method in the general case. Moreover, we described several ways to speed up and stabilize the subgradient method. Finally, we called this method “Infeasible point subgradient algorithm for ${\ell^1}$”: ISAL1. A Matlab implementation can be found and the SPEAR website.

• HSE, the heuristic support evaluation: That’s a pretty neat device which can be integrated in any Basis Pursuit solver (beware: not Basis Pursuit denoising; we want the equality constraint). The idea is based on the following small lemma:

Lemma 1 A feasible vector ${\bar x}$ (i.e. ${A\bar x = b}$) is optimal for (1) if and only if there is ${w\in{\mathbb R}^m}$ such that ${A^Tw \in\partial\|\bar x\|_1}$.

The proof basically consists of noting that the normal cone on the constraint ${\{Ax=b\}}$ is the image space of ${A^T}$ and hence, the condition is equivalent to saying that this normal cone intersects the subgradient ${\partial\| \bar x\|_1}$ which is necessary and sufficient for ${\bar x}$ being optimal. In practice the HSE does the following:

• deduce candidate (approx.) support ${S}$ from a given ${x}$
• compute approximate solution ${\hat{w}}$ to ${A_{S}^T w = \text{sgn}(x_{S})}$ by ${w = (A_S^T)^\dagger\text{sgn}(x_S)}$ with the help of CG
• if ${\|A^T \hat{w}\|_\infty \approx 1}$ check existence of a ${\hat{x}}$ with ${A_{S} \hat{x}_{S} = b}$ and ${\hat{x}_i = 0}$ ${\forall\, i \notin S}$
• if that ${\hat x}$ exists, check if the relative duality gap ${(\|\hat{x}\|_1 + b^T (-\hat{w}))/\|\hat{x}\|_1}$ is small and return “success” if so, i.e. take ${\hat x}$ as an optimal solution

Again, CG usually performs great here and only a very few iterations (say ${5}$) are needed. In practice this methods did never return any vector ${\hat x}$ marked as optimal which was wrong.

• Computational comparison: We faced the challenge of a computational comparison for Basis Pursuit solvers.
The first step was, to design a testset. We constructed 100 matrices (74 of which are dense, 26 are sparse) by several constructions and concatenations (see Section 5 in the draft). More complicated was the construction of appropriate right hand sides. Why? Because we wanted to have unique solutions! That is, because we wanted to have the norm difference ${\|x^*-\hat x\|_2}$ between optimal and computed solution as a measure for both optimality and feasibility. In the first place we used the ERC due to Joel Tropp (e.g. described in this blog post of Bob’s blog). However, this does not only guarantee uniqueness of solutions but also that the minimum ${1}$-norm solution is also the sparsest. Since that is probably too much to have for solutions (i.e. they have to be very sparse) we constructed some more right hand sides using L1TestPack: Construct an ${x}$ such that there is ${w}$ such that ${A^T w \in\partial \|x\|_1}$ and use ${b = Ax}$. This also leads to unique solutions for Basis Pursuit if $A$ is injective when restricted to the columns which related to the entries in which $(A^T w)_i = \pm 1$ but allows for much larger supports.For the results of the comparison of ISAL1, SPGL1, YALL1, ${\ell^1}$-MAGIC, SparseLAB, the homotopy solves of Salman Asif and CPLEX check the paper. However, some things are interesting:

1. homotopy is the overall winner (which is somehow clear for the instances constructed with ERC but not for others). Great work Salman!
2. ISAL1 is quite good (although it is the simplest among all methods).
3. HSE works great: Including it e.g. in SPGL1 produces “better” solution in less time.
4. CPLEX is remarkably good (we used the dual simplex). So: How does it come that so many people keep saying that standard LP-solves do not work well for Basis Pursuit? That is simply not true for the dual simplex! (However, the interior point methods in CPLEX was not competitive at all.)

We plan to make a somehow deeper evaluation of our computational results before submitting the paper to have some more detailed conclusions on the performance of the solvers an different instances.

In this post I would like to comment on two papers I “stumbled upon”, one in regularization theory and one in image processing.

The first one is A regularization parameter for nonsmooth Tikhonov regularization by Kazufumi Ito, Bangti Jin and Tomoya Takeuchi. As the title announces, the paper addresses the problem of determining suitable regularization parameter for some kind of Tikhonov regularization. In particular, the authors propose a new heuristic method, i.e. method which does not use any estimate of the noise level in the data. This is an interesting and important topic for several reasons:

1. Practically, estimates on the noise level are rarely available and if they are, they are not too reliable.
2. Strictly speaking, these kind of rules are “bad” since there is the “Bakushinksii Veto”: Such rules only provide regularizations (e.g. in the sense of Engl, Hanke, Neubauer for problems which are well-posed (as a great service, the authors state and prove this statement as Theore 3.2).
3. Despite this veto, several heuristic rules produce excellent results in practice.

Note that the last second points are not in contradiction. They merely say that the notion of “regularization” may be too strict. Usually, it uses a worst case estimate which may practically never observed.

The paper contributes a new rule and state that it is applicable to a broad range of problems. They use very general Tikhonov functional:

$\displaystyle \phi(x,y^\delta) + \eta\psi(x)$

and do not assume that ${\phi}$ or ${\psi}$ are smooth. They use the value function

$\displaystyle F(\eta) = \min_x \phi(x,y^\delta) + \eta\psi(x)$

and propose the following rule for ${\eta}$: For some ${\gamma}$ choose ${\eta}$ such that

$\displaystyle \Phi_\gamma(\eta) = \frac{F(\eta)^{1+\gamma}}{\eta}$

is minimal. I do not have any intuition for this rule (however, from they proofs you see that they work, at least for “partially smooth cases”, see below). Lacking a name for this rule, one may use the term “weighted value function rule”.

They prove several nice properties of the value function (continuity, monotonicity and concavity) with loose assumptions on ${\phi}$ and ${\psi}$ (especially they do not even need existence of minimizers for ${\phi(x,y^\delta) + \eta\psi(x)}$, only that the minimum exists). However, when it comes to error estimates, they only obtain results for a specific discrepancy measure, namely a squares Hilbert space norm:

$\displaystyle \phi(x,y^\delta) = \tfrac12\|Kx-y^\delta\|^2.$

It seems that, for general convex and lower-semicontinuous penalties ${\psi}$ they build upon results from my paper with Bangti Jin on the Hanke-Raus rule and the quasi-optimality principle.

Another contribution of the paper is that it gives an algorithm that realizes the weighted value function rule (a thing which I omitted in my paper with Bangti). Their numerical experiments demonstrate that the weighted value function rule and the proposed algorithm works well for academic examples.

The next paper I want to discuss is the preprint Properties of ${L^1-\text{TGV}^2}$: The one-dimensional case by Kristian Bredies, Karl Kunisch and Tuomo Valkonen. There the authors analyze the somehow recent generalization “total generalized variation” ${\text{TGV}}$ of the omnipresent total variation. The TGV has been proposed by Bredies, Kunisch and Pock in this paper recently and Kristian and me also briefly described it in our book on mathematical image processing. Loosely speaking, the TGV shall be a generalization of the usual total variation which does not lead to “staircasing”. While one may observe from the construction of the TGV functional, that staircasing is not to be expected, the authors in this paper give precise statements. By restricting to the one dimensional case they prove several interesting properties of the TGV functional, most notably that it leads to an equivalent norm of the space ${BV}$.

Maybe I should state the definitions here: The total variation of a function ${u\in L^1(\Omega)}$ is

$\displaystyle \text{TV}(u) = \sup\{\int_\Omega u v'\ |\ v\in C^1_c(\Omega),\ \|v\|_\infty\leq 1\}$

leading the the ${BV}$-norm

$\displaystyle \|u\|_{BV} = \|u\|_{L^1} + \text{TV}(u).$

The ${\text{TGV}^2}$ seminorm for a parameter tuple ${(\alpha,\beta)}$ is

$\displaystyle \text{TGV}^2_{(\alpha,\beta)}(u) = \sup\{\int_\Omega u v''\ |\ C^2_c(\Omega), \|v\|_\infty\leq\beta,\ \|v'\|_\infty\leq\alpha\}$

and the associated norm is

$\displaystyle \|u\|_{BGV^2} = \|u\|_{L^1} + \text{TGV}^2(u).$

In Lemma 3.3 they prove that ${\|\cdot\|_{BV}}$ and ${\|\cdot\|_{BGV^2}}$ are equivalent norms on ${\text{BV}}$. In Section 4 they show that minimizers of

$\displaystyle \|u-f\|_{L^1} + \alpha\text{TV}(u)$

obey staircasing of degree 0, i.e. the solution ${u}$ is piecewise constant when it is not equal to ${f}$. For the minimizers of

$\displaystyle \|u-f\|_{L^1} + \text{TGV}^2_{(\alpha,\beta)}(u)$

one has staircasing of degree 1: ${u}$ is affine linear where it is not equal to ${f}$.

These two facts combined (norm equivalence of ${\text{BV}}$ and ${\text{BGV}^2}$ and the staircasing of degree 1) seem quite remarkable to me. They somehow show that staircasing is not related to the space ${\text{BV}}$ of functions of bounded variation but only to the specific ${\text{TV}}$ semi-norm. This is somehow satisfying since I still remember the thorough motivation of L. Rudin in his 1987 thesis for the usage of the space ${\text{BV}}$ in image processing: If there where images which are not in ${\text{BV}}$ we could not observe them. (He even draws an analogy to the question: How many angles can dance on the point of a needle?) Moreover, he further argues that ${\text{BV}}$ is not too large in the sense that its elements are still accessible to analysis (e.g. in defining a weak notion of curvature although they may be discontinuous). The ${\text{BGV}^2}$-model shows that it is possible to overcome the undesired effect of staircasing while staying in the well founded and mathematically sound and appealing framework of ${\text{BV}}$.

The paper contains several more interesting results (e.g. on preservation of continuity and “affinity” and on convergence of with respect to ${(\alpha,\beta)}$ which I do not collect here.

« Previous Page