ISMP is over now and I’m already home. I do not have many things to report on from the last day. This is not due the lower quality of the talks but due to the fact that I was a little bit exhausted, as usual at the end of a five-day conference. However, I collect a few things for the record:

• In the morning I visited the semi-planary by Xiaojun Chenon non-convex and non-smooth minimization with smoothing methods. Not surprisingly, she treated the problem

$\displaystyle \min_x f(x) + \|x\|_p^p$

with convex and smooth ${f:{\mathbb R}^n\rightarrow{\mathbb R}}$ and ${0. She proposed and analyzed smoothing methods, that is, to smooth the problem a bit to obtain a Lipschitz-continuous objective function ${\phi_\epsilon}$, minimizing this and then gradually decreasing ${\epsilon}$. This works, as she showed. If I remember correctly, she also treated “iteratively reweighted least squares” as I described in my previous post. Unfortunately, she did not include the generalized forward-backward methods based on ${\text{prox}}$-functions for non-convex functions. Kristian and I pursued this approach in our paper Minimization of non-smooth, non-convex functionals by iterative thresholding and some special features of our analysis include:

• A condition which excludes some (but not all) local minimizers from being global.
• An algorithm which avoids this non-global minimizers by carefully adjusting the steplength of the method.
• A result that the number of local minimizers is still finite, even if the problem is posed in ${\ell^2({\mathbb N})}$ and not in ${{\mathbb R}^n}$.

Most of our results hold true, if the ${p}$-quasi-norm is replaced by functions of the form

$\displaystyle \sum_n \phi_n(|x_n|)$

with special non-convex ${\phi}$, namely fulfilling a list of assumptions like

• ${\phi'(x) \rightarrow \infty}$ for ${x\rightarrow 0}$ (infinite slope at ${0}$) and ${\phi(x)\rightarrow\infty}$ for ${x\rightarrow\infty}$ (mild coercivity),
• ${\phi'}$ strictly convex on ${]0,\infty[}$ and ${\phi'(x)/x\rightarrow 0}$ for ${x\rightarrow\infty}$,
• for each ${b>0}$ there is ${a>0}$ such that for ${x it holds that ${\phi(x)>ax^2}$, and
• local integrability of some section of ${\partial\phi'(x) x}$.

As one easily sees, ${p}$-quasi-norms fulfill the assumptions and some other interesting functions as well (e.g. some with very steep slope at ${0}$ like ${x\mapsto \log(x^{1/3}+1)}$).

• Jorge Nocedalgave a talk on second-order methods for non-smooth problems and his main example was a functional like

$\displaystyle \min_x f(x) + \|x\|_1$

with a convex and smooth ${f}$, but different from Xiaojun Chen, he only considered the ${1}$-norm. His talked is among the best planary talks I have ever attended and it was a great pleasure to listen to him. He carefully explained things and put them in perspective. In the case he skipped slides, he made me feel that I either did not miss an important thing, or understood them even though he didn’t show them He argued that it is not necessarily more expensive to use second order information in contrast to first order methods. Indeed, the ${1}$-norm can be used to reduce the number of degrees of freedom for a second order step. What was pretty interesting is, that he advocated semismooth Newton methods for this problem. Roland and I pursued this approach some time ago in our paper A Semismooth Newton Method for Tikhonov Functionals with Sparsity Constraints and, if I remember correctly (my notes are not complete at this point), his family of methods included our ssn-method. The method Roland and I proposed worked amazingly well in the cases in which it converged but the method suffered from non-global convergence. We had some preliminary ideas for globalization, which we could not tune enough to retain the speed of the method, and abandoned the topic. Now, that the topic will most probably be revived by the community, I am looking forward to fresh ideas here.

Today there are several things I could blog on. The first is the planary by Rich Baraniuk on Compressed Sensing. However, I don’t think that I could reflect the content in a way which would be helpful for a potential reader. Just for the record: If you have the chance to visit one of Rich’s talk: Do it!

The second thing is the talk by Bernd Hofmann on source conditions, smoothness and variational inequalities and their use in regularization of inverse problems. However, this would be too technical for now and I just did not take enough notes to write a meaningful post.

As a third thing I have the talk by Christian Clason on inverse problems with uniformly distributed noise. He argued that for uniform noise it is much better to use an ${L^\infty}$ discrepancy term instead of the usual ${L^2}$-one. He presented a path-following semismooth Newton method to solve the problem

$\displaystyle \min_x \frac{1}{p}\|Kx-y^\delta\|_\infty^p + \frac{\alpha}{2}\|x\|_2^2$

and showed examples with different kinds of noise. Indeed the examples showed that ${L^\infty}$ works much better than ${L^2}$ here. But in fact it works even better, if the noise is not uniformly distributed but “impulsive” i.e. it attains bounds ${\pm\delta}$ almost everywhere. It seems to me that uniform noise would need a slightly different penalty but I don’t know which one – probably you do? Moreover, Christian presented the balancing principle to choose the regularization parameter (without knowledge about the noise level) and this was the first time I really got what it’s about. What one does here is, to choose ${\alpha}$ such that (for some ${\sigma>0}$ which only depends on ${K}$, but not on the noise)

$\displaystyle \sigma\|Kx_\alpha^\delta-y^\delta\|_\infty = \frac{\alpha}{2}\|x_\alpha^\delta\|_2^2.$

The rational behind this is, that the left hand side is monotonically non-decreasing in ${\alpha}$, while the right hand side is monotonically non-increasing. Hence, there should be some ${\alpha}$ “in the middle” which make both somewhat equally large. Of course, we do neither want to “over-regularize” (which would usually “smooth too much”) nor to “under-regularize” (which would not eliminate noise). Hence, balancing seems to be a valid choice. From a practical point of view the balancing is also nice because one can use the fixed-point iteration

$\displaystyle \alpha^{n+1} = 2\sigma\frac{\|Kx_{\alpha^n}^\delta - y^\delta\|_\infty}{\|x_{\alpha_n}^\delta\|_2^2}$

which converges in a few number of iterations.

Then there was the talk by Esther Klann, but unfortunately, I was late so only heard the last half…

Last but not least we have the talk by Christiane Pöschl. If you are interested in Total-Variation-Denoising (TV denoising), then you probably have heard many times that “TV denoising preserves edges” (have a look at the Wikipedia page – it claims this twice). What Christiane showed (in a work with Vicent Caselles and M. Novaga) that this claim is not true in general but only for very special cases. In case of characteristic functions, the only functions for which the TV minimizer has sharp edges are these so-called calibrated sets, introduced by Caselles et el. Building on earlier works by Caselles and co-workers she calculated exact minimizers for TV denoising in the case that the image consists of characteristic functions of two convex sets or of a single star shaped domain, that is, for a given set $B$ she calculated the solution of

$\displaystyle \min_u\int (u - \chi_B)^2dx + \lambda \int|Du|.$

This is not is as easy as it may sound. Even for the minimizer for a single convex set one has to make some effort. She presented a nice connection of the shape of the obtained level-sets with the morphological operators of closing and opening. With the help of this link she derived a methodology to obtain the exact TV denoising minimizer for all parameters. I do not have the images right now but be assured that most of the time, the minimizers do not have sharp edges all over the place. Even for simple geometries (like two rectangles touching in a corner) strange things happen and only very few sharp edges appear. I’ll keep you posted in case the paper comes out (or appears as a preprint).

Christiane has some nice images which make this much more clear:

For two circles edges are preserved if they are far enough away from each other. If they are close, the area “in between” them is filled and, moreover, obey this fuzzy boundary. I remember myself seeing effects like this in the output of TV-solvers and thinking “well, it seems that the algorithm is either not good or not converged yet – TV should output sharp edges!”.

For a star-shaped shape (well, actually a star) the output looks like this. The corners are not only rounded but also blurred and this is true both for the “outer” corners and the “inner” corners.

So, if you have any TV-minimizing code, go ahead and check if your code actually does the right things on images like this!
Moreover, I would love to see similar results for more complicated extensions of TV like Total Generalized Variation, I treated here.

Today I report on two things I came across here at ISMP:

• The first is a talk by Russell Luke on Constraint qualifications for nonconvex feasibility problems. Luke treated the NP-hard problem of sparsest solutions of linear systems. In fact he did not tackle this problem but the problem to find an ${s}$-sparse solution of an ${m\times n}$ system of equations. He formulated this as a feasibility-problem (well, Heinz Bauschke was a collaborator) as follows: With the usual malpractice let us denote by ${\|x\|_0}$ the number of non-zero entries of ${x\in{\mathbb R}^n}$. Then the problem of finding an ${s}$-sparse solution to ${Ax=b}$ is:

$\displaystyle \text{Find}\ x\ \text{in}\ \{\|x\|_0\leq s\}\cap\{Ax=b\}.$

In other words: find a feasible point, i.e. a point which lies in the intersection of the two sets. Well, most often feasibility problems involve convex sets but here, the first one given by this “${0}$-norm” is definitely not convex. One of the simplest algorithms for the convex feasibility problem is to alternatingly project onto both sets. This algorithm dates back to von Neumann and has been analyzed in great detail. To make this method work for non-convex sets one only needs to know how to project onto both sets. For the case of the equality constraint ${Ax=b}$ one can use numerical linear algebra to obtain the projection. The non-convex constraint on the number of non-zero entries is in fact even easier: For ${x\in{\mathbb R}^n}$ the projection onto ${\{\|x\|_0\leq s\}}$ consists of just keeping the ${s}$ largest entries of ${x}$ while setting the others to zero (known as the “best ${s}$-term approximation”). However, the theory breaks down in the case of non-convex sets. Russell treated problem in several papers (have a look at his publication page) and in the talk he focused on the problem of constraint qualification, i.e. what kind of regularity has to be imposed on the intersection of the two sets. He could shows that (local) linear convergence of the algorithm (which is observed numerically) can indeed be justified theoretically. One point which is still open is the phenomenon that the method seems to be convergent regardless of the initialization and that (even more surprisingly) that the limit point seems to be independent of the starting point (and also seems to be robust with respect to overestimating the sparsity ${s}$). I wondered if his results are robust with respect to inexact projections. For larger problems the projection onto the equality constraint ${Ax=b}$ are computationally expensive. For example it would be interesting to see what happens if one approximates the projection with a truncated CG-iteration as Andreas, Marc and I did in our paper on subgradient methods for Basis Pursuit.

• Joel Tropp reported on his paper Sharp recovery bounds for convex deconvolution, with applications together with Michael McCoy. However, in his title he used demixing instead of deconvolution (which, I think, is more appropriate and leads to less confusion). With “demixing” they mean the following: Suppose you have two signals ${x_0}$ and ${y_0}$ of which you observe only the superposition of ${x_0}$ and a unitarily transformed ${y_0}$, i.e. for a unitary matrix ${U}$ you observe

$\displaystyle z_0 = x_0 + Uy_0.$

Of course, without further assumptions there is no way to recover ${x_0}$ and ${y_0}$ from the knowledge of ${z_0}$ and ${U}$. As one motivation he used the assumption that both ${x_0}$ and ${y_0}$ are sparse. After the big bang of compressed sensing nobody wonders that one turns to convex optimization with ${\ell^1}$-norms in the following manner:

$\displaystyle \min_{x,y} \|x\|_1 + \lambda\|y\|_1 \ \text{such that}\ x + Uy = z_0. \ \ \ \ \ (1)$

This looks a lot like sparse approximation: Eliminating ${x}$ one obtains the unconstraint problem \begin{equation*} \min_y \|z_0-Uy\|_1 + \lambda \|y\|_1. \end{equation*}

Phrased differently, this problem aims at finding an approximate sparse solution of ${Uy=z_0}$ such that the residual (could also say “noise”) ${z_0-Uy=x}$ is also sparse. This differs from the common Basis Pursuit Denoising (BPDN) by the structure function for the residual (which is the squared ${2}$-norm). This is due to the fact that in BPDN one usually assumes Gaussian noise which naturally lead to the squared ${2}$-norm. Well, one man’s noise is the other man’s signal, as we see here. Tropp and McCoy obtained very sharp thresholds on the sparsity of ${x_0}$ and ${y_0}$ which allow for exact recovery of both of them by solving (1). One thing which makes their analysis simpler is the following reformulation: The treated the related problem \begin{equation*} \min_{x,y} \|x\|_1 \text{such that} \|y\|_1\leq\alpha, x+Uy=z_0 \end{equation*} (which I would call this the Ivanov version of the Tikhonov-problem (1)). This allows for precise exploitation of prior knowledge by assuming that the number ${\alpha_0 = \|y_0\|_1}$ is known.

First I wondered if this reformulation was responsible for their unusual sharp results (sharper the results for exact recovery by BDPN), but I think it’s not. I think this is due to the fact that they have this strong assumption on the “residual”, namely that it is sparse. This can be formulated with the help of ${1}$-norm (which is “non-smooth”) in contrast to the smooth ${2}$-norm which is what one gets as prior for Gaussian noise. Moreover, McCoy and Tropp generalized their result to the case in which the structure of ${x_0}$ and ${y_0}$ is formulated by two functionals ${f}$ and ${g}$, respectively. Assuming a kind of non-smoothness of ${f}$ and ${g}$ the obtain the same kind of results and especially matrix decomposition problems are covered.

The second day of ISMP started (for me) with the session I organized and chaired.

The first talk was by Michael Goldman on Continuous Primal-Dual Methods in Image Processing. He considered the continuous Arrow-Hurwitz method for saddle point problems

$\displaystyle \min_{u}\max_{\xi} K(u,\xi)$

with ${K}$ convex in the first and concave in the second variable. The continuous Arrow-Hurwitz method consists of solving

$\displaystyle \begin{array}{rcl} \partial_t u(t) &=& -\nabla_u K(u(t),\xi(t))\\ \partial_t \xi(t) &=& \nabla_\xi K(u(t),\xi(t)). \end{array}$

His talk evolved around the problem if ${K}$ comes from a functional which contains the total variation, namely he considered

$\displaystyle K(u,\xi) = -\int_\Omega u\text{div}(\xi) + G(u)$

with the additional constraints ${\xi\in C^1_C(\Omega,{\mathbb R}^2)}$ and ${|\xi|\leq 1}$. For the case of ${G(u) = \lambda\|u-f\|^2/2}$ he presented a nice analysis of the problem including convergence of the method to a solution of the primal problem and some a-posteriori estimates. This reminded me of Showalters method for the regularization of ill-posed problems. The Arrow-Hurwitz method looks like a regularized version of Showalters method and hence, early stopping does not seem to be necessary for regularization. The related paper is Continuous Primal-Dual Methods for Image Processing.

The second talk was given by Elias Helou and was on Incremental Algorithms for Convex Non-Smooth Optimization with Applications in Image Reconstructions. He presented his work on a very general framework for problems of the class

$\displaystyle \min_{x\in X} f(x)$

with a convex function ${f}$ and a convex set ${X}$. Basically, he abstracted the properties of the projected subgradient method. This consists of taking subgradient descent steps for ${f}$ followed by projection onto ${X}$ iteratively: With a subgradient ${g^n\in\partial f(x^n)}$ this reads as

$\displaystyle x^{n+1} = P_X(x^n -\alpha_n g^n)$

he extracted the conditions one needs from the subgradient descent step and from the projection step and formulated an algorithm which consists of successive application of an “optimality operator” ${\mathcal{O}_f}$ (replacing the subgradient step) and a feasibility operator ${\mathcal{F}_X}$ (replacing the projection step). The algorithm then reads as

$\displaystyle \begin{array}{rcl} x^{n+1/2} &=& \mathcal{O}_f(x^n,\alpha_n)\\ x^{n+1} &=& \mathcal{F}_x(x^{n+1/2} \end{array}$

and he showed convergence under the extracted conditions. The related paper is , Incremental Subgradients for Constrained Convex Optimization: a Unified Framework and New Methods.

The third talk was by Jerome Fehrenbach on Stripes removel in images, apllications in microscopy. He considered the problem of very specific noise which is appear in the form of stripes (and appears, for example, “single plane illumination microscopy”). In fact he considered a little more general case and the model he proposed was as follows: The observed image is

$\displaystyle u_{\text{OBS}} = u + n,$

i.e. the usual sum of the true image ${u}$ and noise ${n}$. However, for the noise he assumed that it is given by

$\displaystyle n = \sum_{j=1}^m \psi_j*\lambda_j,$

i.e. it is a sum of different convolutions. The ${\psi_j}$ are kind of shape-functions which describe the “pattern of the noise” and the ${\lambda_j}$ are samples of noise processes, following specific distributions (could be white noise realizations, impulsive noise or something else)-. He then formulated a variational method to identify the variables ${\lambda_j}$ which reads as

$\displaystyle \min \|\nabla(u_{\text{OBS}} - \sum_{j=1}^m \psi_j*\lambda_j)\|_1 + \sum_j \phi_j(\lambda_j).$

Basically, this is the usual variational approach to image denoising, but nor the optimization variable is the noise rather than the image. This is due to the fact that the noise has a specific complicated structure and the usual formulation with ${u = u_{\text{OBS}} +n}$ is not feasible. He used the primal-dual algorithm by Chambolle and Pock for this problem and showed that the method works well on real world problems.

Another theme which caught my attention here is “optimization with variational inequalities as constraints”. At first glance that sounds pretty awkward. Variational inequalities can be quite complicated things and why on earth would somebody considers these things as side conditions in optimization problems? In fact there are good reasons to do so. One reason is, if you have to deal with bi-level optimization problems. Consider an optimization problem

$\displaystyle \min_{x\in C} F(x,p) \ \ \ \ \ (1)$

with convex ${C}$ and ${F(\cdot,p)}$ (omitting regularity conditions which could be necessary to impose) depending on a parameter ${p}$. Now consider the case that you want to choose the parameter ${p}$ in an optimal way, i.e. it solves another optimization problem. This could look like

$\displaystyle \min_p G(x)\quad\text{s.t.}\ x\ \text{solves (1)}. \ \ \ \ \ (2)$

Now you have an optimization problem as a constraint. Now we use the optimality condition for the problem~(1): For differentiable ${F}$, ${x^*}$ solves~(1) if and only if

$\displaystyle \forall y\in C:\ \nabla_x F(x^*(p),p)(y-x^*(p))\geq 0.$

In other words: We con reformulate (2) as

$\displaystyle \min_p G(x)\quad\text{s.t.}\ \forall y\in C:\ \nabla_x F(x^*(p),p)(y-x^*(p))\geq 0. \ \ \ \ \ (3)$

And there it is, our optimization problem with a variational inequality as constraint. Here at ISMP there are entire sessions devoted to this, see here and here.

The scientific program at ISMP started today and I planned to write a small personal summary of each day. However, it is a very intense meeting. Lot’s of excellent talks, lot’s of people to meet and little spare time. So I’m afraid that I have to deviate from my plan a little bit. Instead of a summary of every day I just pick out a few events. I remark that these picks do not reflect quality, significance or something like this in any way. I just pick things for which I have something to record for personal reasons.

My day started after the first plenary which the session Testing environments for machine learning and compressed sensing in which my own talk was located. The session started with the talk by Michael Friedlander of the SPOT toolbox. Haven’t heard of SPOT yet? Take a look! In a nutshell its a toolbox which turns MATLAB into “OPLAB”, i.e. it allows to treat abstract linear operators like matrices. By the way, the code is on github.

The second talk was by Katya Scheinberg (who is giving a semi-planary talk on derivative free optimization at the moment…). She talked about speeding up FISTA by cleverly adjusting step-sizes and over-relaxation parameters and generalizing these ideas to other methods like alternating direction methods. Notably, she used the “SPEAR test instances” from our project homepage! (And credited them as “surprisingly hard sparsity problems”.)

My own talk was the third and last one in that session. I talked about the issue of constructing test instance for Basis Pursuit Denoising. I argued that the naive approach (which takes a matrix ${A}$, a right hand side ${b}$ and a parameter ${\lambda}$ and let some great solver run for a while to obtain a solution ${x^*}$) may suffer from “trusted method bias”. I proposed to use “reverse instance construction” which is: First choose ${A}$, ${\lambda}$ and the solution ${x^*}$ and the construct the right hand side ${b}$ (I blogged on this before here).

Last but not least, I’d like to mention the talk by Thomas Pock: He talked about parameter selection on variational models (think of the regularization parameter in Tikhonov, for example). In a paper with Karl Kunisch titled A bilevel optimization approach for parameter learning in variational models they formulated this as a bi-level optimization problem. An approach which seemed to have been overdue! Although they treat somehow simple inverse problems (well, denoising) (but with not so easy regularizers) it is a promising first step in this direction.

Today I arrived at ISMP in Berlin. This seems to be the largest conference on optimization and is hosted by the Mathematical Optimization Society. (As a side note: The society recently changed its name from MPS (Mathematical Programming Society) to MOS. Probably the conference will be called ISMO in a few years…).

The reception today was special in comparison to conference receptions I have attended so far. First, it was held in the Konzerthaus which is a pretty fancy neo-classical building. Well, I’ve been to equally fancy buildings at conference receptions at GAMM or AIP conferences already, but the distinguishing feature this evening was the program. As usual it featured welcome notes by important people (notably, the one by the official government representative was accurate and entertaining!), prizes and music. The music was great, the host (G.M. Ziegler) did a great job and the ceremony felt like a show rather than an opening reception.

From the prices I’d like to mention two:

After this reception I am looking even more forward to the rest of this conference.

As I a side note: Something seems to be wrong with me and optimization conferences. It seems like every time I visit such a conference, I am loosing my cell-phone. Happened to me at SIOPT 2011 in Darmstadt and happened to me again today…

A few days ago I’ve been to IFIP TC7 and participated in the minisymposium “Optimization in Banach spaces with sparsity constraints” organized by Kristian Bredies and Christian Clason. My session was very interesting, consisting of talks of Masha Zhariy (Optimization algorithms for sparse reconstruction in inverse problems with adaptive stopping rules), Caroline Vehoeven (On a generalization of the iterative soft-thresholding algorithm for the case of non-separable penalty) and Kristian (Inverse problems in spaces of measures). Unfortunately, I could not take part in the second half of the minisymposium.

Although the minisymposium was interesting, I’d like to talk about the plenary Talk “Signal and systems representation; signal spaces, sampling, and quantization effects” by Holger Boche. The title sounded as I should be able to follow, but I’ve never heard the name before. And indeed, Holger Boche gave a fresh view onto the sampling and reconstruction problem, i.e. the digital-to-analog conversion and its inverse.

It was somehow refreshing to have a talk about sampling which was totally free of the words “compressive” and “compressed”. Holger Boche modelled continuous time signals in the well known Bernstein and Paley-Wiener spaces and started with fairly old but not too well known results on exact reconstruction from non-uniform sampling.

Definition 1 For ${1\leq p\leq \infty}$ and ${\sigma>0}$, the Paley-Wiener space ${\mathcal{PW}_\sigma^p}$ is

$\displaystyle \mathcal{PW}_\sigma^p = \{\hat g\ :\ g\in L^p([-\sigma,\sigma])\}.$

Definition 2 A function ${\phi}$ is of sine-type if it is of exponential type ${\pi}$ and

1. its zeros are separated, and
2. there exist real ${A,B,H}$ such that for all ${|y|\geq H}$ it holds that

$\displaystyle Ae^{\pi|y|}\leq |f(x+iy)| \leq Be^{\pi|y|}.$

Then, one theorem of reconstruction for non-uniform sampling points goes as follows:

Theorem 3 Let ${\phi}$ be a function of sine-type, whose zeros ${(t_k)}$ are all real and define

$\displaystyle \phi_k(t) = \frac{\phi(t)}{\phi'(t_k)(t-t_k)}.$

Then, for every ${f\in\mathcal{PW_\pi^1}}$ and all ${T>0}$ it holds that

$\displaystyle \lim_{N\rightarrow\infty}\|f - \sum_{k=-N}^N f(t_k)\phi_k\|_{L^\infty([-T,T])} \rightarrow 0.$

This theorem is a generalization of a result of J.L. Brown on local uniform convergence of the Shannon-sampling series.

These results are related to a sampling theorem due to Logan which I learned some years ago. First we recall that the Hilbert transform ${\mathcal{H}f}$ of a function ${f}$ can be defined via the Fourier transform ${\mathcal{F}}$ as

$\displaystyle \mathcal{F}(\mathcal{H}(u))(\omega) = (-i\text{sign}(\omega))\mathcal{F}(u)(\omega).$

Then, Logan’s Theorem is as follows:

Theorem 4 (Logan’s Sampling Theorem) Let ${f_1,f_2:\mathbb{R}\rightarrow\mathbb{R}}$ such that the support of both ${\hat f_1}$ and ${\hat f_2}$ is contained in ${[-b,-a]\cup[a,b]}$ for some number ${a}$ and ${b}$ with ${0. Then, if ${f_1}$ and ${f_2}$ do not have a root in common with their Hilbert transform, it holds that ${\text{sign} f_1 = \text{sign} f_2}$ implies that ${f_2}$ is equal to ${f_1}$ up to a constant factor.

Proof: (Informal) Since modulation shifts the Fourier transform, we can write a function ${f}$ with ${\text{supp}\hat f\subset -b,-a]\cup[a,b]}$ as a modulated sum of functions ${h_1}$ and ${h_2}$ with bandwidth ${(b-a)/2}$, using ${\mu=(a+b)/2}$ as follows

$\displaystyle f(t) = -i h_1(t)e^{i\mu t} + if_2(t)e^{-i\mu t}.$

Expressing ${\exp}$ in terms of ${\sin}$ and ${\cos}$ gives

$\displaystyle \begin{array}{rcl} f(t) &=& (-if_1(t) + if_2(t))\cos(\mu t) + (f_1(t)+f_2(t))\sin(\mu t)\\ &=& p(t)\cos(\mu t) + q(t)\sin(\mu t). \end{array}$

Note that ${p}$ and ${q}$ are real valued since ${f}$ is assumed to be real valued. Using ${\mathcal{H}(\sin) = \cos}$ and ${\mathcal{H}(\cos) = -\sin}$ we conclude that

$\displaystyle \mathcal{H}(f)(t) = p(t)\sin(\mu t) - q(t)\cos(\mu t).$

With the obvious notation we obtain that

$\displaystyle \mathcal{H}(f_1)(t)f_2(t) - \mathcal{H}(f_2)(t)f_1(t) = p_1(t)q_2(t) - p_2(t)q_1(t) = g(t).$

Now, the zeros of ${g}$ are at least the common zeros of ${f_1}$ and ${f_2}$. Moreover, the bandwidth of ${g}$ is at most ${b-a}$ (since the bandwidth of ${p_j}$ and ${q_j}$ is at most ${(b-a)/2}$). We can conclude that ${g}$ is identically zero by an argument which uses that upper density of the zeros of ${f_1}$ (resp. ${f_2}$) is high enough to force ${g}$ to zero. Hence, we know that

$\displaystyle \frac{f_1(t)}{\mathcal{H}(f_1)(t)} = \frac{f_2(t)}{\mathcal{H}(f_2)(t)}$

To finish the proof one needs some argument from complex analysis (using that both sides of the above equations are extendable to meromorphic function of exponetial type) to conclude that ${f_1}$ equals ${f_2}$ up to a constant factor. $\Box$

When I asked about the relation between Logan’s theorem and his results, he replied that Logan’s theorem was the starting point since he was curious why there was no practical implementation of this sampling procedure available. More precisely, he was interested in designing a digital implementation, based on sampling points, which allows to digitize all linear time invariant filters. One of his negative results was the following.

Definition 5 A sequence ${(t_k)}$ is called a complete interpolation sequence for ${\mathcal{PW}_\pi^2}$ and ${\ell^2}$ if for every ${c\in\ell^2}$ there is exactly one ${f\in \mathcal{PW}_\pi^2}$ such that ${f(t_k) = c_k}$.

Theorem 6 For a complete interpolation sequence ${(t_k)}$ for ${\mathcal{PW}_\pi^2}$ and ${\ell^2}$, the corresponding reconstruction functions ${\phi_k}$ and a given ${t\in\mathbb{R}}$ there always exists a stable linear time invariant filter ${T}$ and a signal ${f\in \mathcal{PW}_\pi^1}$ such that

$\displaystyle \lim\sup_{N\rightarrow\infty} | (Tf)(t) - \sum_{k=-N}^N f(t_k)(T\phi_k)(t)| = \infty.$

He conjectured (Conjecture 2.1 in his talk), that this phenomenon of point-wise divergence of the approximation of a linear time invariant filter remains, even if oversampling is applied, that there always is a function ${f\in\mathcal{PW}_{\beta\pi}^1}$ (${0<\beta<1}$) such that point-wise divergence happens. Moreover, he conjectured that (Conjecture 2.2) digitizing stable linear time invariant filter should be possible if one replaced point-sampling by appropriate linear functionals and posed the open problem to both prove this conjecture and to find such linear functionals.

Although I am at ENUMATH since four days, I did not post any news yet.

Most talks here dealt with numerical methods for PDEs and since this is not my primary topic I sometimes had a hard time to grab what the problems and goals were. One exception was the talk by Franco Brezzi. Ha gave an entertaining talk entitled “To reconstruct or not to reconstruct?”. In a nutshell  he talked about discretization methods for PDEs by either finite differences and finite elements. He distinguished the two methods by the fact that finite difference methods only use and produce “nodal values”,  i.e. values of the solution at specific points. On the other hand, finite element methods  also work with a set of values describing the solution. However, these numbers are coefficients to some basis functions and hence, one can “reconstruct” a true function from these values. His first point was, that methods with “reconstruction” usually have much simples proofs for convergence and so on. However, finite difference methods are usually much more simple to implement since the discretization itself explicitly dictates the linear system one has to solve. He then introduced “mimetic finite differences” which, in my incomplete understanding, are a kind of finite difference methods that incorporate more geometric information. During his talk I thought about phrasing his concepts of “reconstruction” and “evaluating” in the language of signal processing as “interpolation” and “sampling” and if this would give another perspective which could be helpful.

On Thursday and Friday, a minisymposium on “Compressed Sensing and Sparse Approximation Algorithms” takes place at ILAS 11.

Holger Rauhut talked about Compressive Sensing and Structured Random Matrices and his abstract was

Compressive sensing is a recent paradigm in signal processing and sampling theory that predicts that sparse signals can be recovered from a small number of linear and non- adaptive measurements using convex optimization or greedy algorithms. Quite remark- ably, all good constructions of the so called measurement matrix known so far are based on randomness. While Gaussian random matrices provide optimal recovery guarantees, such unstructured matrices are of limited use in applications. Indeed, structure often allows to have fast matrix vector multiplies. This is crucial in order to speed up recovery algorithms and to deal with large scale problems. The talk discusses models of structured random matrices that are useful in certain applications, and presents corresponding recovery guarantees. An important type of structured random matrix arises in connection with sampling sparse expansions in terms of bounded orthogonal systems (such as the Fourier system). The second type of structured random matrices to be discussed are partial random circulant matrices, that is, from convolution. In particular, we present recent results with J. Romberg and J. Tropp on the restricted isometry property of such matrices.

Mark A. Iwen talked about Compressed Sensing for Manifold Data and his abstract was

Preliminary work involving the recovery of manifold data using compressive measurements will be discussed, along with related sampling bounds for the approximation of manifold data via a recent multiscale piecewise linear approximation method known as “Geometric Wavelets” [Allard, Chen, Maggioni, Multiscale Geometric Methods for Data Sets II: Geometric Wavelets, Submitted, 2011].

John Wright talked about Local Correctness of Dictionary Learning Algorithms and his abstract was

The idea that many important classes of signals can be well-represented by linear com- binations of a small set of atoms selected from a given dictionary has had dramatic impact on the theory and practice of signal processing. For practical problems in which an appropriate sparsifying dictionary is not known ahead of time, a very popular and successful heuristic is to search for a dictionary that minimizes an appropriate sparsity surrogate over a given set of sample data. In this talk, we describe steps towards un- derstanding when this problem can be solved by efficient algorithms. We show that under mild hypotheses, the dictionary learning problem is locally well-posed: the desired solution is a local minimum of the 1 norm. Namely, if ${A \in \mathbf{R}^{m\times n}}$ is an incoherent (and possibly overcomplete) dictionary, and the coefficients ${X \in \mathbf{R}^{n\times p}}$ follow a random sparse model, then with high probability ${(A, X)}$ is a local minimum of the 1 norm over the manifold of factorizations ${(A', X')}$ satisfying ${A' X' = Y}$ , provided the number of samples ${p = \Omega(n^3 k)}$. For overcomplete A, this is the first result showing that the dictionary learning problem is locally solvable.

The last speaker of the first session was Jakob Lemvig, talking on Sparse Dual Frames and his abstract was

Frames are generalizations of bases which lead to redundant signal expansions, and they play an important role in many applications, e.g., in the theory of nonuniform sampling, wireless communications, and Sigma-Delta quantization. Sparsity of frame vectors (in some fixed orthonormal basis) is a new paradigm in frame theory that among other things allows for simple representation of the frame vectors and fast decomposi- tion and reconstruction procedures. Recently, a general construction method for sparse tight frames was proposed in [Casazza, Heinecke, Krahmer, Kutyniok, Optimally sparse frames, preprint]. In this talk, however, we will initiate the study of sparse dual frames of a given (not necessarily tight nor sparse) frame. We present theoretical and experimental results showing that sparse duals can have many desirable properties as well as providing fast reconstruction.

Today we have to following speakers:

Gitta Kutyniok, Separation of Data by Sparse Approximations

Modern data is customarily of multimodal nature, and analysis tasks typically require separation into the single components. Although a highly ill-posed problem, the morphological difference of these components often allows a precise separation. A novel methodology consists in applying 1 minimization to a composed dictionary consisting of tight frames each sparsifying one of the components. In this talk, we will first discuss a very general approach to derive estimates on the accuracy of separation using cluster coherence and clustered/geometric sparsity. Then we will use these results to analyze performance of this methodology for separation of image components.

Götz E. Pfander, From the Bourgain Tzafriri Restricted Invertibility Theorem to restricted isometries

The Bourgain-Tzafriri Restricted Invertibility Theorem states conditions under which a Riesz bases (for its span) can be extracted from a given system of norm one vectors in finite dimensional spaces. The challenge is to choose a large subset while making sure that the resulting lower Riesz bound remains large. An upper Riesz bound of the selected Riesz bases is inherited from the frame bound of the original system of vectors. In this talk, we shall present an algorithm that allows us to control both, the lower and the upper, Riesz bounds.

Sadegh Jokar, Compressed Sensing and Sparse Solution of PDEs

Compressed sensing says that a high dimensional but sparse signal can be reconstructed from only a small number of (random) measurements. We consider a related but different problem. We are interested in the numerical solution of partial differential equations ${Lu = f}$ with a differential operator L using some ideas from compressed sensing. Considering a classical Galerkin or Petrov-Galerkin finite ele- ment approach, one seeks a solution u in some function space ${U}$ (which is spanned by ${\{\varphi_1 , \dots,\varphi_n\}}$), represented as ${u = \sum_{i=1}^n u_i \varphi_i}$. Here we would like to find the sparsest representation of the solution ${u}$ in the space ${U = span\{u_1 , \dots , u_n \}}$ using discretized finite elements via ${\ell^1}$-minimization. Specially in adaptive methods, the usual approach to achieve this goal is to use local a posteriori error estimation to determine where a refinement is needed, but we use 1 -minimization and linear programming to perform the adaptive refinement in the finite element method in such a way that the solution is sparsely represented by a linear combination of basis functions. We also perform the analysis of solutions of underdetermined linear systems that have Kronecker product structure and show some theoretical results in this direction.

In fact, Sadegh also talked about his work on Compressed Sensing with matrices which are Kronecker product where he investigated how several interesting constants behave under the formation of Kronecker products. He (I guess) invented the term “sparkity” for the smallness of the spark of a matrix, and then the result sound funny: The sparkity of a Kronecker product is equal to the minimum of the sparkity of the products.

Felix Krahmer, New and Improved Johnson-Lindenstrauss Embeddings via the Restricted Isometry Property

The Johnson-Lindenstrauss (JL) Lemma states that any set of p points in high dimensional Euclidean space can be embedded into ${O(\delta^{-2} \log(p))}$ dimensions, without distorting the distance between any two points by more than a factor between ${1-\delta}$ and ${1 + \delta}$ . We establish a new connection between the JL Lemma and the Restricted Isometry Property (RIP), a well-known concept in the theory of sparse recovery often used for showing the success of ${\ell^1}$-minimization. Consider an ${m \times N}$ matrix satisfying the ${(k, \delta_k )}$-RIP and an arbitrary set ${E}$ of ${O(e^k }$points in ${R^N}$ . We show that with high probability, such a matrix with randomized column signs maps ${E}$ into ${R^m}$ without distorting the distance between any two points by more than a factor of ${1 \pm 4\delta k}$ . Consequently, matrices satisfying the Restricted Isometry of optimal order provide optimal Johnson-Lindenstrauss embeddings up to a logarithmic factor in ${N}$. Moreover, our results yield the best known bounds on the necessary embedding dimension m for a wide class of structured random matrices. Our results also have a direct application in the area of compressed sensing for redundant dictionaries.

After a very refreshing and enjoying summer vacation I am back to work and back to blogging. This week there is the ILAS 11 (where ILAS stands for International Linear Algebra Society) here at TU Braunschweig; and since the talks take place in the building just next to where I am, I enjoyed some of them. Two talks have been especially interesting to me.

The first one was Tikhonov Regularization for Large Scale Inverse Problems by Melina Freitag (maybe the first link is not working yet but Melina said that she is going to upload her slides under that address). She talked about the ways the weather forecast is done these days in the UK and in Europe and especially on the process of Data Assimilation where one uses the previous weather forecasts and newly arrived measurements to produce a better estimate of the state. As a matter of fact, two popular methods use in this field (3dVar and 4dVar) are equivalent to classical Tikhonov regularization. In her section on ${L^1}$-penalties (which I usually call ${\ell^1}$-penalties…) she actually introduced a kind of discrete ${TV}$-penalty as a replacement for the usual quadratic (${\ell^2}$) penalty. Her motivation was as usual: Tikhonov regularization smoothes too much and weather fronts are not smooth. She did not have results of this kind of ${TV}$ regularization as a replacement in 4dVar with real weather data but with smaller toy examples (with non-linear advection equations) since the resulting optimization problem is LARGE. However, her results look promising. I am not sure if she did, but one was tempted to arrive at the conclusion that “4dVar with ${TV}$ penalty gives a better resolution of weather fronts”. It happened that during her talk there was thunderstorm with heavy rain in front of the windows which has not been predicted by the forecast (according to which, the thunderstorm should happen the next day). Now: Would a ${TV}$ penalty be able to predict this thunderstorm for the right time? I am not sure. While ${TV}$ penalties do enforce edges, the precise place of the edge is still not too sure. My feeling is, that the accuracy of the position is better, the less the curvature of the edge is, but in general this highly depends on the ill-posed problem at hand.

The second talk was Recent Progress in the Solution of Discrete Ill-Posed Problems by Michiel Hochstenbach. The talk was pretty amusing; I especially liked the slogan for discrete ill-posed problems:

How to wisely divide by zero.

Also he introduced the three forms of variational regularization which I usually call Tikhonov, Ivanov and Morozov regularization (on slide 34) and introduced the Pareto front (under the name it usually has in discrete ill-posed problems: L-curve).

Another appealing slogan was:

Discrete ill-posed problems are essentially underdetermined.

(Since we do not want to solve the respective equation exactly, we usually have a lot of approximate solution of which we have to choose the one we like the most. Of course this is the same for continuous ill-posed problems.) As a consequence: One should use as much prior knowledge as possible.

In the rest of his talk he talked about improvements of the classical Tikhonov regularization (which is a linear method) by other linear methods with respect to both reconstruction quality and computational burden. He motivated his SRSVD (subspace restricted SVD) by the observation that the simple truncated SVD is often failing due to the fact that the first singular vectors do not contain useful information of the solution. He proposed to use a selected set of (orthonormal) vectors and use a restricted SVD in the following. However, does this use the knowledge that the solution shall be a linear combination of these selected vectors? And wouldn’t it be beneficial to use a larger set of (non-orthonormal) vectors, put into a (possible overcomplete) dictionary and use a sparse reconstruction method such a ${\ell^1}$ regularization which automatically selects the relevant vectors? He also proposed the so called “linear combination approach” which basically takes several outputs of various methods and search within the linear combinations of this outputs for a better solution. To do so he proposed to use another Ivanov-type regularization (slide 79). I still did not get why he uses the largest available norm as a constraint here… However there should be an answer somewhere is his papers.

Edit: Michiel sent me the following answer:

I used the largest available norm, since the norms of many solution approaches are often smaller than, or approximately equal to the true norm. In the paper Discrete ill-posed least-squares problems with a solution norm constraint we reach the conclusion that
“For the approach based on the solution norm constraint, it seems important that $\|x\|$ not be underestimated.”

Edit: The above mentioned paper Discrete ill-posed least-squares problems with a solution norm constraint is to be published in “Linear Algebra and its Applications”. It can be found via its doi.