### Signal and image processing

In this post I will explore a bit the question on how to calculate the discrete gradient and the discrete divergence of an image and a vector field, respectively.

Let ${u_0\in{\mathbb R}^{N\times M}}$ be a discrete image, i.e. ${u_0(i,j)}$ denotes the gray value at the ${i,j}$-th pixel. The famous total variation denoising amounts to minimizing the functional

$\displaystyle \Phi(u) = \tfrac12 \int (u-u_0)^2 + \lambda\int|\nabla u|$

where the integral shall be understood as summation, ${\nabla}$ stand for the gradient, i.e. the vector of the partial derivatives, and ${|\cdot|}$ stands for the euclidean absolute value.

When using primal-dual methods, it is of crucial importance, that the used operators for the gradient and the divergence are numerically adjoint (up to the minus sign). That is, the numerical operations are adjoint in the following sense: If grad is the operation for the gradient and div is the operation for the divergence, then it should hold for any variables ${u}$ and ${v}$ of suitable size and with gradu = grad(u), divv = div(v) that sum(gradu(:).*v(:)) and -sum(u(:).*divv(:)) are equal up to numerical precision. Due to the boundary treatment, the internal MATLAB operations gradient and divergence do not fulfill this requirement.

The most common discretization of the gradient uses discrete forward differences and a constant padding at the boundary (which means that Neumann boundary values are applied). In formula, this reads as

$\displaystyle (D_xu)_{i,j} = \begin{cases} u_{i+1,j} - u_{i,j} & i

The respective adjoints are backward differences with zero boundary treatment (check it!). Apparently, there are many different ways to implement this routines (and the respective adjoints) in MATLAB. Here are four of them:

1. For loops: Run through all pixels in two for-loops and assign the difference to the output. Of course, one should preallocate the output prior to the loop. But you may probably know the first two rules of MATLAB coding? If not, here they are: 1. Avoid for-loops. 2. Avoid for-loops, seriously. I put the routines into extra functions and created anonymous function to call the gradient and the divergence as
grad = @(u) cat(3,dxp(u),dyp(u));
div = @(V) dxm(V(:,:,1)) + dym(V(:,:,2));

2. Shift and subtract: MATLAB is great in using vectors and matrices. And to avoid the for-loop one could also implement the forward difference in ${x}$-direction by shifting the matrix and subtract the original one, i.e. [u(:,2:end) u(:,end)] - u (and similarly for the other differences). Again, I wrote extra functions and used anonymous function as above.
3. Small sparse matrices from the left and from the right: MATLAB is also pretty good with sparse matrices. Since the derivatives in ${x}$-direction only involve the subtraction of two elements in the same column, one can realize this by multiplying an image from the left with a sparse diagonal matrix with just two non-zero diagonals. Similarly, the derivative in ${y}$-direction can be realized by multiplying from the right with a suitable matrix. More precisely, this approach is realized by
Dy = spdiags([-ones(M,1) ones(M,1)],[0 1],M,M);
Dy(M,:) = 0;
Dx = spdiags([-ones(N,1) ones(N,1)],[0 1],N,N);
Dy(N,:) = 0;
Dxu = Dx*u;
Dyu = u*Dy';


(check it). Note that the adjoint of ${x}$-derivative if simple the operation Dx'*u and the operation of the ${y}$-derivative is u*Dy. Together, the calculation of the gradient and the divergence was done by the anonymous functions

grad = @(u) cat(3,u*Dx',Dy*u);
div = @(V) V(:,:,1)*Dx + Dy'*V(:,:,2);

4. Large sparse matrices: One could think about the following: Vectorize the image by U = u(:) (which amounts to stacking the columns above each other). Then assemble a large ${NM\times NM}$ sparse matrix which has just two non-zero diagonals to do the forward (and other) differences. More precisely this can be done by
(with Dx and Dy from above)

DX = kron(Dx,speye(M));
DY = kron(speye(N),Dy);
DxU = DX*U;
DyU = DY*U;


Here, it is clear that the respective adjoint are just the multiplication with the transposed matrices. Here, the anonymous functions are

grad = @(u) [DX*u DY*u];
div = @(V) DX'*V(:,1) + DY'*V(:,2);


The different approaches have different pros and cons. Well, the for-loop only has cons: It is presumably slow and uses a lot indexing which easily leads to bugs. The shift-and-subtract method should go into an extra function to make it easy to use – but this is not necessarily a drawback. For the multiplication with the large matrix, one has to vectorize the image first and every time one wants to look at the result and need to do reshape(U,N,M). But let’s focus on speed: I implemented all methods, and let the run on square images of different sizes for 50 times (after a warmup) and measures the execution times with tic and toc. The assembly of the matrices did not enter the timing. Moreover, no parallel computation was used – just a single core. Finally, memory was not an issue since the larges matrices (of size ${2500\times 2500\times 2}$) only need roughly 100MB of memory.

Here is the table with the average times for one calculation of the gradient (in seconds):

 ${N}$ For-loop Shift-subtract left-right left 100 0.0004 0.0002 0.0001 0.0003 200 0.0018 0.0010 0.0005 0.0015 300 0.0057 0.0011 0.0014 0.0020 400 0.0096 0.0031 0.0022 0.0035 500 0.0178 0.0035 0.0030 0.0054 750 0.0449 0.0114 0.0097 0.0123 1000 0.0737 0.0189 0.0128 0.0212 1500 0.2055 0.0576 0.0379 0.0601 2000 0.3942 0.0915 0.0671 0.1136 2500 0.6719 0.1571 0.1068 0.1788

and here is the one for the divergences:

 ${N}$ For-loop Shift-subtract left-right left 100 0.0004 0.0003 0.0002 0.0002 200 0.0018 0.0015 0.0005 0.0008 300 0.0048 0.0016 0.0015 0.0012 400 0.0090 0.0036 0.0020 0.0022 500 0.0158 0.0057 0.0027 0.0035 750 0.0409 0.0132 0.0073 0.0069 1000 0.0708 0.0238 0.0130 0.0125 1500 0.2008 0.0654 0.0344 0.0370 2000 0.3886 0.1285 0.0622 0.0671 2500 0.6627 0.2512 0.1084 0.1361

As expected, the for-loop is clearly slower and also as expected, all methods basically scale quadratically (doubling ${N}$ amounts to multiplying the running time by four) since the work per pixel is constant. A little surprisingly, the multiplication from the left and from the right is fastest and also consistently a little faster then multiplication from the left with the larger sparse matrix. I don’t know, why the results are that different for the gradient and for the divergence. Maybe this is related to my use on anonymous functions or the allocation of memory?

Another few notes to myself:

ISMP is over now and I’m already home. I do not have many things to report on from the last day. This is not due the lower quality of the talks but due to the fact that I was a little bit exhausted, as usual at the end of a five-day conference. However, I collect a few things for the record:

• In the morning I visited the semi-planary by Xiaojun Chenon non-convex and non-smooth minimization with smoothing methods. Not surprisingly, she treated the problem

$\displaystyle \min_x f(x) + \|x\|_p^p$

with convex and smooth ${f:{\mathbb R}^n\rightarrow{\mathbb R}}$ and ${0. She proposed and analyzed smoothing methods, that is, to smooth the problem a bit to obtain a Lipschitz-continuous objective function ${\phi_\epsilon}$, minimizing this and then gradually decreasing ${\epsilon}$. This works, as she showed. If I remember correctly, she also treated “iteratively reweighted least squares” as I described in my previous post. Unfortunately, she did not include the generalized forward-backward methods based on ${\text{prox}}$-functions for non-convex functions. Kristian and I pursued this approach in our paper Minimization of non-smooth, non-convex functionals by iterative thresholding and some special features of our analysis include:

• A condition which excludes some (but not all) local minimizers from being global.
• An algorithm which avoids this non-global minimizers by carefully adjusting the steplength of the method.
• A result that the number of local minimizers is still finite, even if the problem is posed in ${\ell^2({\mathbb N})}$ and not in ${{\mathbb R}^n}$.

Most of our results hold true, if the ${p}$-quasi-norm is replaced by functions of the form

$\displaystyle \sum_n \phi_n(|x_n|)$

with special non-convex ${\phi}$, namely fulfilling a list of assumptions like

• ${\phi'(x) \rightarrow \infty}$ for ${x\rightarrow 0}$ (infinite slope at ${0}$) and ${\phi(x)\rightarrow\infty}$ for ${x\rightarrow\infty}$ (mild coercivity),
• ${\phi'}$ strictly convex on ${]0,\infty[}$ and ${\phi'(x)/x\rightarrow 0}$ for ${x\rightarrow\infty}$,
• for each ${b>0}$ there is ${a>0}$ such that for ${x it holds that ${\phi(x)>ax^2}$, and
• local integrability of some section of ${\partial\phi'(x) x}$.

As one easily sees, ${p}$-quasi-norms fulfill the assumptions and some other interesting functions as well (e.g. some with very steep slope at ${0}$ like ${x\mapsto \log(x^{1/3}+1)}$).

• Jorge Nocedalgave a talk on second-order methods for non-smooth problems and his main example was a functional like

$\displaystyle \min_x f(x) + \|x\|_1$

with a convex and smooth ${f}$, but different from Xiaojun Chen, he only considered the ${1}$-norm. His talked is among the best planary talks I have ever attended and it was a great pleasure to listen to him. He carefully explained things and put them in perspective. In the case he skipped slides, he made me feel that I either did not miss an important thing, or understood them even though he didn’t show them He argued that it is not necessarily more expensive to use second order information in contrast to first order methods. Indeed, the ${1}$-norm can be used to reduce the number of degrees of freedom for a second order step. What was pretty interesting is, that he advocated semismooth Newton methods for this problem. Roland and I pursued this approach some time ago in our paper A Semismooth Newton Method for Tikhonov Functionals with Sparsity Constraints and, if I remember correctly (my notes are not complete at this point), his family of methods included our ssn-method. The method Roland and I proposed worked amazingly well in the cases in which it converged but the method suffered from non-global convergence. We had some preliminary ideas for globalization, which we could not tune enough to retain the speed of the method, and abandoned the topic. Now, that the topic will most probably be revived by the community, I am looking forward to fresh ideas here.

Today there are several things I could blog on. The first is the planary by Rich Baraniuk on Compressed Sensing. However, I don’t think that I could reflect the content in a way which would be helpful for a potential reader. Just for the record: If you have the chance to visit one of Rich’s talk: Do it!

The second thing is the talk by Bernd Hofmann on source conditions, smoothness and variational inequalities and their use in regularization of inverse problems. However, this would be too technical for now and I just did not take enough notes to write a meaningful post.

As a third thing I have the talk by Christian Clason on inverse problems with uniformly distributed noise. He argued that for uniform noise it is much better to use an ${L^\infty}$ discrepancy term instead of the usual ${L^2}$-one. He presented a path-following semismooth Newton method to solve the problem

$\displaystyle \min_x \frac{1}{p}\|Kx-y^\delta\|_\infty^p + \frac{\alpha}{2}\|x\|_2^2$

and showed examples with different kinds of noise. Indeed the examples showed that ${L^\infty}$ works much better than ${L^2}$ here. But in fact it works even better, if the noise is not uniformly distributed but “impulsive” i.e. it attains bounds ${\pm\delta}$ almost everywhere. It seems to me that uniform noise would need a slightly different penalty but I don’t know which one – probably you do? Moreover, Christian presented the balancing principle to choose the regularization parameter (without knowledge about the noise level) and this was the first time I really got what it’s about. What one does here is, to choose ${\alpha}$ such that (for some ${\sigma>0}$ which only depends on ${K}$, but not on the noise)

$\displaystyle \sigma\|Kx_\alpha^\delta-y^\delta\|_\infty = \frac{\alpha}{2}\|x_\alpha^\delta\|_2^2.$

The rational behind this is, that the left hand side is monotonically non-decreasing in ${\alpha}$, while the right hand side is monotonically non-increasing. Hence, there should be some ${\alpha}$ “in the middle” which make both somewhat equally large. Of course, we do neither want to “over-regularize” (which would usually “smooth too much”) nor to “under-regularize” (which would not eliminate noise). Hence, balancing seems to be a valid choice. From a practical point of view the balancing is also nice because one can use the fixed-point iteration

$\displaystyle \alpha^{n+1} = 2\sigma\frac{\|Kx_{\alpha^n}^\delta - y^\delta\|_\infty}{\|x_{\alpha_n}^\delta\|_2^2}$

which converges in a few number of iterations.

Then there was the talk by Esther Klann, but unfortunately, I was late so only heard the last half…

Last but not least we have the talk by Christiane Pöschl. If you are interested in Total-Variation-Denoising (TV denoising), then you probably have heard many times that “TV denoising preserves edges” (have a look at the Wikipedia page – it claims this twice). What Christiane showed (in a work with Vicent Caselles and M. Novaga) that this claim is not true in general but only for very special cases. In case of characteristic functions, the only functions for which the TV minimizer has sharp edges are these so-called calibrated sets, introduced by Caselles et el. Building on earlier works by Caselles and co-workers she calculated exact minimizers for TV denoising in the case that the image consists of characteristic functions of two convex sets or of a single star shaped domain, that is, for a given set $B$ she calculated the solution of

$\displaystyle \min_u\int (u - \chi_B)^2dx + \lambda \int|Du|.$

This is not is as easy as it may sound. Even for the minimizer for a single convex set one has to make some effort. She presented a nice connection of the shape of the obtained level-sets with the morphological operators of closing and opening. With the help of this link she derived a methodology to obtain the exact TV denoising minimizer for all parameters. I do not have the images right now but be assured that most of the time, the minimizers do not have sharp edges all over the place. Even for simple geometries (like two rectangles touching in a corner) strange things happen and only very few sharp edges appear. I’ll keep you posted in case the paper comes out (or appears as a preprint).

Christiane has some nice images which make this much more clear:

For two circles edges are preserved if they are far enough away from each other. If they are close, the area “in between” them is filled and, moreover, obey this fuzzy boundary. I remember myself seeing effects like this in the output of TV-solvers and thinking “well, it seems that the algorithm is either not good or not converged yet – TV should output sharp edges!”.

For a star-shaped shape (well, actually a star) the output looks like this. The corners are not only rounded but also blurred and this is true both for the “outer” corners and the “inner” corners.

So, if you have any TV-minimizing code, go ahead and check if your code actually does the right things on images like this!
Moreover, I would love to see similar results for more complicated extensions of TV like Total Generalized Variation, I treated here.

The scientific program at ISMP started today and I planned to write a small personal summary of each day. However, it is a very intense meeting. Lot’s of excellent talks, lot’s of people to meet and little spare time. So I’m afraid that I have to deviate from my plan a little bit. Instead of a summary of every day I just pick out a few events. I remark that these picks do not reflect quality, significance or something like this in any way. I just pick things for which I have something to record for personal reasons.

My day started after the first plenary which the session Testing environments for machine learning and compressed sensing in which my own talk was located. The session started with the talk by Michael Friedlander of the SPOT toolbox. Haven’t heard of SPOT yet? Take a look! In a nutshell its a toolbox which turns MATLAB into “OPLAB”, i.e. it allows to treat abstract linear operators like matrices. By the way, the code is on github.

The second talk was by Katya Scheinberg (who is giving a semi-planary talk on derivative free optimization at the moment…). She talked about speeding up FISTA by cleverly adjusting step-sizes and over-relaxation parameters and generalizing these ideas to other methods like alternating direction methods. Notably, she used the “SPEAR test instances” from our project homepage! (And credited them as “surprisingly hard sparsity problems”.)

My own talk was the third and last one in that session. I talked about the issue of constructing test instance for Basis Pursuit Denoising. I argued that the naive approach (which takes a matrix ${A}$, a right hand side ${b}$ and a parameter ${\lambda}$ and let some great solver run for a while to obtain a solution ${x^*}$) may suffer from “trusted method bias”. I proposed to use “reverse instance construction” which is: First choose ${A}$, ${\lambda}$ and the solution ${x^*}$ and the construct the right hand side ${b}$ (I blogged on this before here).

Last but not least, I’d like to mention the talk by Thomas Pock: He talked about parameter selection on variational models (think of the regularization parameter in Tikhonov, for example). In a paper with Karl Kunisch titled A bilevel optimization approach for parameter learning in variational models they formulated this as a bi-level optimization problem. An approach which seemed to have been overdue! Although they treat somehow simple inverse problems (well, denoising) (but with not so easy regularizers) it is a promising first step in this direction.

In this post I just collect a few papers that caught my attention in the last moth.

I begin with Estimating Unknown Sparsity in Compressed Sensing by Miles E. Lopes. The abstract reads

Within the framework of compressed sensing, many theoretical guarantees for signal reconstruction require that the number of linear measurements ${n}$ exceed the sparsity ${\|x\|_0}$ of the unknown signal ${x\in\mathbb{R}^p}$. However, if the sparsity ${\|x\|_0}$ is unknown, the choice of ${n}$ remains problematic. This paper considers the problem of estimating the unknown degree of sparsity of ${x}$ with only a small number of linear measurements. Although we show that estimation of ${\|x\|_0}$ is generally intractable in this framework, we consider an alternative measure of sparsity ${s(x):=\frac{\|x\|_1^2}{\|x\|_2^2}}$, which is a sharp lower bound on ${\|x\|_0}$, and is more amenable to estimation. When ${x}$ is a non-negative vector, we propose a computationally efficient estimator ${\hat{s}(x)}$, and use non-asymptotic methods to bound the relative error of ${\hat{s}(x)}$ in terms of a finite number of measurements. Remarkably, the quality of estimation is dimension-free, which ensures that ${\hat{s}(x)}$ is well-suited to the high-dimensional regime where ${n<. These results also extend naturally to the problem of using linear measurements to estimate the rank of a positive semi-definite matrix, or the sparsity of a non-negative matrix. Finally, we show that if no structural assumption (such as non-negativity) is made on the signal ${x}$, then the quantity ${s(x)}$ cannot generally be estimated when ${n<.

It’s a nice combination of the observation that the quotient ${s(x)}$ is a sharp lower bound for ${\|x\|_0}$ and that it is possible to estimate the one-norm and the two norm of a vector ${x}$ (with additional properties) from carefully chosen measurements. For a non-negative vector ${x}$ you just measure with the constant-one vector which (in a noisy environment) gives you an estimate of ${\|x\|_1}$. Similarly, measuring with Gaussian random vector you can obtain an estimate of ${\|x\|_2}$.

Then there is the dissertation of Dustin Mixon on the arxiv: Sparse Signal Processing with Frame Theory which is well worth reading but too long to provide a short overview. Here is the abstract:

Many emerging applications involve sparse signals, and their processing is a subject of active research. We desire a large class of sensing matrices which allow the user to discern important properties of the measured sparse signal. Of particular interest are matrices with the restricted isometry property (RIP). RIP matrices are known to enable eﬃcient and stable reconstruction of sfficiently sparse signals, but the deterministic construction of such matrices has proven very dfficult. In this thesis, we discuss this matrix design problem in the context of a growing field of study known as frame theory. In the ﬁrst two chapters, we build large families of equiangular tight frames and full spark frames, and we discuss their relationship to RIP matrices as well as their utility in other aspects of sparse signal processing. In Chapter 3, we pave the road to deterministic RIP matrices, evaluating various techniques to demonstrate RIP, and making interesting connections with graph theory and number theory. We conclude in Chapter 4 with a coherence-based alternative to RIP, which provides near-optimal probabilistic guarantees for various aspects of sparse signal processing while at the same time admitting a whole host of deterministic constructions.

By the way, the thesis is dedicated “To all those who never dedicated a dissertation to themselves.”

Further we have Proximal Newton-type Methods for Minimizing Convex Objective Functions in Composite Form by Jason D Lee, Yuekai Sun, Michael A. Saunders. This paper extends the well explored first order methods for problem of the type ${\min g(x) + h(x)}$ with Lipschitz-differentiable ${g}$ or simple ${\mathrm{prox}_h}$ to second order Newton-type methods. The abstract reads

We consider minimizing convex objective functions in composite form

$\displaystyle \min_{x\in\mathbb{R}^n} f(x) := g(x) + h(x)$

where ${g}$ is convex and twice-continuously differentiable and ${h:\mathbb{R}^n\rightarrow\mathbb{R}}$ is a convex but not necessarily differentiable function whose proximal mapping can be evaluated efficiently. We derive a generalization of Newton-type methods to handle such convex but nonsmooth objective functions. Many problems of relevance in high-dimensional statistics, machine learning, and signal processing can be formulated in composite form. We prove such methods are globally convergent to a minimizer and achieve quadratic rates of convergence in the vicinity of a unique minimizer. We also demonstrate the performance of such methods using problems of relevance in machine learning and high-dimensional statistics.

With this post I say goodbye for a few weeks of holiday.

Today I would like to comment on two arxiv-preprints I stumbled upon:

1. “Augmented L1 and Nuclear-Norm Models with a Globally Linearly Convergent Algorithm” – The Elastic Net rediscovered

The paper “Augmented L1 and Nuclear-Norm Models with a Globally Linearly Convergent Algorithm” by Ming-Jun Lai and Wotao Yin is another contribution to a field which is (or was?) probably the fastest growing field in applied mathematics: Algorithms for convex problems with non-smooth ${\ell^1}$-like terms. The “mother problem” here is as follows: Consider a matrix ${A\in{\mathbb R}^{m\times n}}$, ${b\in{\mathbb R}^m}$ try to find a solution of

$\displaystyle \min_{x\in{\mathbb R}^n}\|x\|_1\quad\text{s.t.}\quad Ax=b$

or, for ${\sigma>0}$

$\displaystyle \min_{x\in{\mathbb R}^n}\|x\|_1\quad\text{s.t.}\quad \|Ax-b\|\leq\sigma$

which appeared here on this blog previously. Although this is a convex problem and even has a reformulation as linear program, some instances of this problem are notoriously hard to solve and gained a lot of attention (because their applicability in sparse recovery and compressed sensing). Very roughly speaking, a part of its hardness comes from the fact that the problem is neither smooth nor strictly convex.

The contribution of Lai and Yin is that they analyze a slight perturbation of the problem which makes its solution much easier: They add another term in the objective; for ${\alpha>0}$ they consider

$\displaystyle \min_{x\in{\mathbb R}^n}\|x\|_1 + \frac{1}{2\alpha}\|x\|_2^2\quad\text{s.t.}\quad Ax=b$

or

$\displaystyle \min_{x\in{\mathbb R}^n}\|x\|_1 + \frac{1}{2\alpha}\|x\|_2^2\quad\text{s.t.}\quad \|Ax-b\|\leq\sigma.$

This perturbation does not make the problem smooth but renders it strongly convex (which usually makes the dual more smooth). It turns out that this perturbation makes life with this problem (and related ones) much easier – recovery guarantees still exists and algorithms behave better.

I think it is important to note that the “augmentation” of the ${\ell^1}$ objective with an additional squared ${\ell^2}$-term goes back to Zou and Hastie from the statistics community. There, the motivation was as follows: They observed that the pure ${\ell^1}$ objective tends to “overpromote” sparsity in the sense that if there are two columns in ${A}$ which are almost equally good in explaining some component of ${b}$ then only one of them is used. The “augmented problem”, however, tends to use both of them. They coined the method as “elastic net” (for reasons which I never really got).

I also worked on elastic-net problems for problems in the form

$\displaystyle \min_x \frac{1}{2}\|Ax-b\|^2 + \alpha\|x\|_1 + \frac{\beta}{2}\|x\|_2^2$

in this paper (doi-link). Here it also turns out that the problem gets much easier algorithmically. I found it very convenient to rewrite the elastic-net problem as

$\displaystyle \min_x \frac{1}{2}\|\begin{bmatrix}A\\ \sqrt{\beta} I\end{bmatrix}x-\begin{bmatrix}b\\ 0\end{bmatrix}\|^2 + \alpha\|x\|_1$

which turns the elastic-net problem into just another ${\ell^1}$-penalized problem with a special matrix and right hand side. Quite convenient for analysis and also somehow algorithmically.

2. Towards a Mathematical Theory of Super-Resolution

The second preprint is “Towards a Mathematical Theory of Super-Resolution” by Emmanuel Candes and Carlos Fernandez-Granda.

The idea of super-resolution seems to pretty old and, very roughly speaking, is to extract a higher resolution of a measured quantity (e.g. an image) than the measured data allows. Of course, in this formulation this is impossible. But often one can gain something by additional knowledge of the image. Basically, this also is the idea behind compressed sensing and hence, it does not come as a surprise that the results in compressed sensing are used to try to explain when super-resolution is possible.

The paper by Candes and Fernandez-Granada seems to be pretty close in spirit to Exact Reconstruction using Support Pursuit on which I blogged earlier. They model the sparse signal as a Radon measure, especially as a sum of Diracs. However, different from the support-pursuit-paper they use complex exponentials (in contrast to real polynomials). Their reconstruction method is basically the same as support pursuit: The try to solve

$\displaystyle \min_{x\in\mathcal{M}} \|x\|\quad\text{s.t.}\quad Fx=y, \ \ \ \ \ (1)$

i.e. they minimize over the set of Radon measures ${\mathcal{M}}$ under the constraint that certain measurements ${Fx\in{\mathbb R}^n}$ result in certain given values ${y}$. Moreover, they make a thorough analysis of what is “reconstructable” by their ansatz and obtain a lower bound on the distance of two Diracs (in other words, a lower bound in the Prokhorov distance). I have to admit that I do not share one of their claims from the abstract: “We show that one can super-resolve these point sources with infinite precision—i.e. recover the exact locations and amplitudes—by solving a simple convex program.” My point is that I can not see to what extend the problem (1) is a simple one. Well, it is convex, but it does not seem to be simple.

I want to add that the idea of “continuous sparse modelling” in the space of signed measures is very appealing to me and appeared first in Inverse problems in spaces of measures by Kristian Bredies and Hanna Pikkarainen.

Recently the recipients of Sloan Fellowships for 2012 has been announced. This is a kind grant/price awarded to young scientists, usually people who are assistant professors on the tenure track, from the U.S. and Canada. While the actual award is not exorbitant (but still large) the Sloan Fellowships seem to be a good indicator for further success in the career and, more importantly, for awaited breakthroughs by the fellows.

This year there are 20 mathematicians among the recipients and it appeared that I knew three of them:

• Rachel Ward (UTexas at Austin), a student of Ingrid Daubechies, has written a very nice PhD Thesis “Freedom through imperfection” on signal processing based on redundancy. Among several interesting works she has a very recent preprint Stable image reconstruction using total variation minimization which I plan to cover in a future post (according to the abstract the paper provides rigorous theory for exact recovery with discrete total variation which is cool).
• Ben Recht (University of Wisconsin, Madison) also works in the field of signal processing (among other fields) and is especially known for his work on exact matrix completion with compressed sensing techniques (together with Emmanuel Candes); I also liked his paper “The Convex Geometry of Linear Inverse Problems” (with Venkat Chandrasekaran, Pablo A. Parrilo, and Alan Willsky) with elaborated on the generalization of ${\ell^1}$-minimization and nuclear norm minimization.
• Greg Blekherman (Georgia Tech) works (among other things) on algebraic geometry and especially on the problem which non-negative polynomials (in more the one variable) can be written as sums of squares of polynomials, a question which is related to one of Hilbert’s 23 problems, namely Hilbert’s seventeenth problem. An interesting thing about non-negative polynomials and sums of squares is that: 1. Checking if a polynomial is non-negative is NP-hard. 2. Checking is a polynomial is a sum of squares can be done fast. 3. It seem that “most” non-negative polynomials are actually sums of squares but it unclear “how many of them”, see Greg’s chapter of the forthcoming book “Semidefinite Optimization and Convex Algebraic Geometry” here.

Congratulations!

Today I write about sparse recovery of “multidimensional” signal. With “multidimensional” I mean something like this: A one-dimensional signal is a vector ${x\in{\mathbb R}^n}$ while a two-dimensional signal is a matrix ${x\in{\mathbb R}^{n_1\times n_2}}$. Similarly, ${d}$-dimensional signal is ${x\in{\mathbb R}^{n_1\times\cdots\times n_d}}$. Of course, images as two-dimensional signals, come time mind. Moreover, movies are three-dimensional, a hyperspectral 2D image (which has a whole spectrum attached to any pixel) is also three-dimensional), and time-dependent volume-data is four-dimensional.

Multidimensional data is often a challenge due the large amount of data. While the size of the signals is usually not the problem it is more the size of the measurement matrices. In the context of compressed sensing or sparse recovery the signal is measured with a linear operator, i.e. one applies a number ${m}$ of linear functionals to the signal. In the ${d}$-dimensional case this can be encoded as a matrix ${A\in {\mathbb R}^{m\times \prod_1^d n_i}}$ and this is where the trouble with the data comes in: If you have megapixel image (which is still quite small) the matrix has a million of columns and if you have a dense matrix, storage becomes an issue.

One approach (which is indeed quite old) to tackle this problem is, to consider special measurement matrices: If the signal has a sparse structure is every slice, i.e. every vectors of the form ${x(i_1,\dots,i_{k-1},:,i_{k+1},\dots,i_d)}$ where we fix all but the ${k}$-th component, then the Kronecker product of measurement matrices for each dimension is the right thing.

The Kronecker product of two matrices ${A\in{\mathbb R}^{m\times n}}$ and ${B\in{\mathbb R}^{k\times j}}$ is the ${mk\times nj}$ matrix

$\displaystyle A\otimes B = \begin{bmatrix} a_{11}B & \dots & a_{1n}B\\ \vdots & & \vdots\\ a_{n1}B & \dots & a_{nn}B \end{bmatrix}.$

This has a lot to do with the tensor product and you should read the Wikipedia entry. Moreover, it is numerically advantageous not to build the Kronecker product of dense matrices if you only want to apply it to a given signal. To see this, we introduce the vectorization operator ${\text{vec}:{\mathbb R}^{m\times n}\rightarrow{\mathbb R}^{nm}}$ which takes a matrix ${X}$ and stacks its columns into a tall column vector. For matrices ${A}$ and ${B}$ (of fitting sizes) it holds that

$\displaystyle (B^T\otimes A)\text{vec}(X) = \text{vec}(AXB).$

So, multiplying ${X}$ from the left and from the right gives the application of the Kronecker product.

The use of Kronecker products in numerical linear algebra is fairly old (for example they are helpful for multidimensional finite difference schemes where you can build Kronecker products of sparse difference operators in respective dimensions). Recently, they have been discovered for compressed sensing and sparse recovery in these two papers: Sparse solutions to underdetermined Kronecker product systems by Sadegh Jokar and Volker Mehrmann and the more recent Kronecker Compressed Sensing by Marco Duarte and Rich Baraniuk.

From these papers you can extract some interestingly simple and nice theorems:

Theorem 1 For matrices ${A_1,\dots A_d}$ with restricted isometry constant ${\delta_K(A_1),\dots,\delta_K(A_d)}$ of order ${K}$ it holds that the restricted isometry constant of the Kronecker product fulfills

$\displaystyle \max_i \delta_K(A_i) \leq \delta_K(A_1\otimes\cdots\otimes A_d) \leq \prod_1^d (1+\delta_K(A_i))-1.$

Basically, the RIP constant of a Kronecker product is not better than the worst one but still not too large.

Theorem 2 For matrices ${A_1,\dots, A_d}$ with columns normalized to one, it hold that the spark of their Kronecker product fulfills

$\displaystyle \text{spark}(A\otimes \dots\otimes A_d) = = \min_i\text{spark}(A_i).$

Theorem 3 For matrices ${A_1,\dots, A_d}$ with columns normalized to one, it hold that the mutual coherence of their Kronecker product fulfills

$\displaystyle \mu(A_1\otimes\dots\otimes A_d) = \max_i \mu(A_i).$

In a recent post I wrote about time varying channels. These were operators, described by the spreading function ${s:{\mathbb R}\times{\mathbb R}\rightarrow{\mathbb C}}$, mapping an input signal ${x}$ to an output signal ${y}$ via

$\displaystyle y(t) = H_s x(t) = \int_{{\mathbb R}}\int_{{\mathbb R}} s(\tau,\nu)x(t-\tau)\mathrm{e}^{\mathrm{i}\nu t}d\nu d\tau.$

1. The problem statement

In this post I write about the problem of identifying the channels from input-output measurements. More precisely, the question is as follows:

Is it possible to choose a single input signal such that the corresponding output signal allows the computation of the spreading function?

At first sight this seems hopeless: The degrees of freedom we have is a single function ${x:{\mathbb R}\rightarrow{\mathbb C}}$ and we look for a function ${s:{\mathbb R}\times{\mathbb R}\rightarrow{\mathbb C}}$. However, additional assumptions on ${s}$ may help and indeed there is the following identification theorem due to Kailath:

Theorem 1 (Kailath) The channel ${H_s}$ can be identified if the support of the spreading function ${s}$ is contained in a rectangle with area smaller than ${2\pi}$.

Probably this looks a bit weird: The size of the support shall allow for identification? However, after turning the problem a bit around one sees that there is an intuitive explanation for this theorem which also explains the constant ${2\pi}$.

2. Translation to integral operators

The first step is, to translate the problem into one with integral operators of the form

$\displaystyle F_k x(t) = \int_{\mathbb R} k(t,\tau)x(\tau)d\tau.$

We observe that ${k}$ is linked to ${s}$ via

$\displaystyle k(t,\tau) = \int s(t-\tau,\nu)e^{i\nu t}d\nu = (2\pi)^{1/2}\mathcal{F}_2^{-1}(s(t-\tau,\cdot))(t). \ \ \ \ \ (1)$

and hence, identifying ${s}$ is equivalent to identifying ${k}$ (by the way: I use the same normalization of the Fourier transform as in this post).

From my point of view, the formulation of the identification problem is a bit clearer in the form of the integral operator ${F_k}$. It very much resembles “matrix-vector multiplication”: First you multiply ${k(t,\tau)}$ with you input signal ${x}$ in ${\tau}$ (forming “pointwise products of the rows and the input vector” as in matrix-vector multiplication) and then integrate with respect to ${t}$ (“summing up” the result).

3. Sending Diracs through the channel

Now, let’s see, if we can identify ${k}$ from a single input-output measurement. We start, and send a single Dirac impulse through the channel: ${x = \delta_0}$. This formally gives

$\displaystyle y(t) = \int_{\mathbb R} k(t,\tau)\delta_0(\tau)d\tau = k(t,0).$

Ok, this is not too much. From the knowledge of ${y}$ we can directly infer the values ${k(t,0)}$ but that’s it – nothing more.

But we could also send a Dirac at a different position ${\tau_0}$, ${x = \delta_{\tau_0}}$ and obtain

$\displaystyle y(t) = \int_{\mathbb R} k(t,\tau)\delta_{\tau_0}(\tau)d\tau = k(t,\tau_0).$

Ok, different Diracs give us information about ${k}$ for all ${t}$ but only for a single ${\tau_0}$. Let’s send a whole spike train (or Dirac comb) of “width” ${c>0}$: ${x = \Delta_c = \sum_{n\in{\mathbb Z}} \delta_{cn}}$. We obtain

$\displaystyle y(t) = \int_{\mathbb R} k(t,\tau)\sum_{n\in{\mathbb Z}}\delta_{nc}(\tau)d\tau = \sum_{n\in{\mathbb Z}}k(t,cn).$

Oh – this does not reveal a single value of ${k}$ but “aggregated” values…

4. Choosing the right spike train

Now let’s translate the one condition on ${s}$ we have to a condition on ${k}$: The support of ${s}$ is contained in a rectangle with area less then ${2\pi}$. Let’s assume that this rectangle in centered around zero (which is no loss of generality) and denote it by ${[-a/2,a/2]\times [-b/2,b/2]}$, i.e. ${s(\tau,\nu) = 0}$ if ${|\tau|>a/2}$ or ${|\nu|>b/2}$ (and of course ${ab<2\pi}$). From (1) we conclude that

$\displaystyle k(t,\tau) = 0\ \text{ for }\ |t-\tau|>a/2.$

In the ${(t,\tau)}$-plane, the support looks like this:

If we now send this spike train in the channel, we can visualize the output as follows:

We observe that in this case we can infer the values of ${k(t,nc)}$ at some points ${t}$ but at other points we have an overlap and only observe a sum of ${k(t,nc)}$ and ${k(t,(n+1)c)}$. But if we make ${c}$ large enough, namely larger than ${a}$, we identify the values of ${k(t,nc)}$ exactly in the output signal!

5. Is that enough?

Ok, now we know something about ${k}$ but how can we infer the rest of the values? Don’t forget, that we know that ${ab < 2\pi}$ and that we know that ${s(\tau,\nu)}$ is supported in ${[-a/2,a/2]\times[-b/2,b/2]}$. Up to now we only used the value of ${a}$. How does the support limitation if ${\nu}$ translates to ${k}$? We look again at (1) and see: The function ${\tilde k_t:\tau\mapsto k(t,t-\tau)}$ is bandlimited with bandwidth ${b/2}$ for every ${t}$. And what do we know about these functions ${\tilde k_t}$? We know the samples ${k(t,nc) = \tilde k_t(t-nc)}$. In other words, we have samples of ${\tilde k_t}$ with the sampling rate ${c}$. But there is the famous Nyquist-Shannon sampling theorem:

Theorem 2 (Nyquist-Shannon) A function ${f:{\mathbb R}\rightarrow{\mathbb C}}$ which is bandlimited with bandwidth ${B}$ is totally determined by its discrete time samples ${f(n\pi/B)}$, ${n\in{\mathbb Z}}$ and it holds that

$\displaystyle f(t) = \sum_{n\in{\mathbb Z}} f(n\pi/N)\textup{sinc}\Big(\tfrac{B}{\pi}(x - \tfrac{n\pi}{B})\Big).$

We are happy with the first assertion: The samples totally determine the function if they are dense enough. In our case the bandwidth of ${\tilde k_t}$ is ${b/2}$ and hence, we need the sampling rate ${2\pi/b}$. What we have, are samples with rate ${c}$.

We collect our conditions on ${c}$: We need ${c}$

• larger than ${a}$ to separate the values of ${k(t,nc)}$ in the output.
• smaller than ${2\pi/b}$ to determine the full functions ${\tilde k_t(\tau) = k(t,t-\tau)}$.

Both together say

$\displaystyle a < c < \frac{2\pi}{b}.$

Such a ${c}$ exists, if ${ab < 2\pi}$ and we have proven Theorem 1.

6. Concluding remarks

The proof of Kailath’s theorem reveal the role of the constraint ${ab <2\pi}$: We want ${a}$ not to be too large to be able to somehow separate the values of ${k(t,nc)}$ in the output measurement and we need ${b}$ not too large to ensure that we can interpolate the values ${k(t,nc)}$ exactly.

However, a severe drawback of this results is that one needs to know ${a}$ to construct the “sensing signal” which was the spike train ${\Delta_c}$. Moreover, this sensing signal has itself an infinite bandwidth which is practically not desirable. It seems to be an open question whether there are more practical sensing signals.

There is more known about sensing of time varying channels: The support of ${s}$ does not have to be in a rectangle, it is enough if the measure of the support is smaller than ${2\pi}$ (a result usually attributed to Bello). Moreover, there are converse results which say that linear and stable identification is only possible under this restriction on the support size (see the work of Götz Pfander and coauthors).

Next Page »