### Signal and image processing

I fell a little bit behind on reporting on my new preprints. In this posts I’ll blog on two closely related ones; one of them already a bit old, the other one quite recent:

The papers are

As clear from the titles, both papers treat a similar method. The first paper contains all the theory and the second one has few particularly interesting applications.

In the first paper we propose to view several known algorithms such as the linearized Bregman method, the Kaczmarz method or the Landweber method from a different angle from which they all are special cases of another algorithm. To start with, consider a linear system

$\displaystyle Ax=b$

with ${A\in{\mathbb R}^{m\times n}}$. A fairly simple and old method to solve this, is the Landweber iteration which is

$\displaystyle x^{k+1} = x^k - t_k A^T(Ax^k-b).$

Obviously, this is nothing else than a gradient descent for the functional ${\|Ax-b\|_2^2}$ and indeed converges to a minimizer of this functional (i.e. a least squares solution) if the stepsizes ${t_k}$ fulfill ${\epsilon\leq t_k\leq 2\|A\|^{-2} - \epsilon}$ for some ${\epsilon>0}$. If one initializes the method with ${x^0=0}$ it converges to the least squares solution with minimal norm, i.e. to ${A^\dag b}$ (with the pseudo-inverse ${A^\dag}$).

A totally different method is even older: The Kaczmarz method. Denoting by ${a_k}$ the ${k}$-th row of ${A}$ and ${b_k}$ the ${k}$-th entry of ${b}$ the method reads as

$\displaystyle x^{k+1} = x^k - a_{r(k)}^T\frac{a_{r(k)}\cdot x^k - b_k}{\|a_{r(k)}\|_2^2}$

where ${r(k) = (k\mod m) +1}$ or any other “control sequence” that picks up every index infinitely often. This method also has a simple interpretation: Each equation ${a_k\cdot x = b_k}$ describes a hyperplane in ${{\mathbb R}^n}$. The method does nothing else than projecting the iterates orthogonally onto the hyperplanes in an iterative manner. In the case that the system has a solution, the method converges to one, and if it is initialized with ${x^0=0}$ we have again convergence to the minimum norm solution ${A^\dag b}$.

There is yet another method that solves ${Ax=b}$ (but now it’s a bit more recent): The iteration produces two sequences of iterates

$\displaystyle \begin{array}{rcl} z^{k+1} & = &z^k - t_k A^T(Ax^k - b)\\ x^{k+1} & = &S_\lambda(z^{k+1}) \end{array}$

for some ${\lambda>0}$, the soft-thresholding function ${S_\lambda(x) = \max(|x|-\lambda,0)\mathrm{sgn}(x)}$ and some stepsize ${t_k}$. For reasons I will not detail here, this is called the linearized Bregman method. It also converges to a solution of the system. The method is remarkably similar, but different from, the Landweber iteration (if the soft-thresholding function wouldn’t be there, both would be the same). It converges to the solution of ${Ax=b}$ that has the minimum value for the functional ${J(x) = \lambda\|x\|_1 + \tfrac12\|x\|_2^2}$. Since this solution of close, and for ${\lambda}$ large enough identical, to the minimum ${\|\cdot\|_1}$ solution, the linearized Bregman method is a method for sparse reconstruction and applied in compressed sensing.

Now we put all three methods in a joint framework, and this is the framework of split feasibility problems (SFP). An SFP is a special case of a convex feasibility problems where one wants to find a point ${x}$ in the intersection of multiple simple convex sets. In an SFP one has two different kinds of convex constraints (which I will call “simple” and “difficult” in the following):

1. Constraints that just demand that ${x\in C_i}$ for some convex sets ${C_i}$. I call these constraints “simple” because we assume that the projection onto each ${C_i}$ is simple to obtain.
2. Constraints that demand ${A_ix\in Q_i}$ for some matrices ${A_i}$ and simple convex sets ${Q_i}$. Although we assume that projections onto the ${Q_i}$ are easy, these constraints are “difficult” because of the presence of the matrices ${A_i}$.

If there were only simple constraints a very basic method to solve the problem is the methods of alternating projections, also known as POCS (projection onto convex sets): Simply project onto all the sets ${C_i}$ in an iterative manner. For difficult constraints, one can do the following: Construct a hyperplane ${H_k}$ that separates the current iterate ${x^k}$ from the set defined by the constraint ${Ax\in Q}$ and project onto the hyperplane. Since projections onto hyperplanes are simple and since the hyperplane separates we move closer to the constraint set and this is a reasonable step to take. One such separating hyperplane is given as follows: For ${x^k}$ compute ${w^k = Ax^k-P_Q(Ax^k)}$ (with the orthogonal projection ${P_Q}$) and define

$\displaystyle H_k = \{x\ : (A^Tw^k)^T\cdot x \leq (A^Tw^k)^T\cdot x^k - \|w^k\|_2^2\}.$

Illustration of projections onto convex sets and separating hyperplanes

Now we already can unite the Landweber iteration and the Kaczmarz method as follows: Consider the system ${Ax=b}$ as a split feasibility problem in two different ways:

1. Treat ${Ax=b}$ as one single difficult constraint (i.e. set ${Q=\{b\}}$). Some calculations show that the above proposed method leads to the Landweber iteration (with a special stepsize).
2. Treat ${Ax=b}$ as ${m}$ simple constraints ${a_i\cdot x = b_i}$. Again, some calculations show that this gives the Kaczmarz method.

Of course, one could also work “block-wise” and consider groups of equations as difficult constraints to obtain “block-Kaczmarz methods”.

Now comes the last twist: By adapting the term of “projection” one gets more methods. Particularly interesting is the notion of Bregman projections which comes from Bregman distances. I will not go into detail here, but Bregman distances are associated to convex functionals ${J}$ and by replacing “projection onto ${C_i}$ or hyperplanes” by respective Bregman projections, one gets another method for split feasibility problems. The two things I found remarkable:

• The Bregman projection onto hyperplanes is pretty simple. To project some ${x^k}$ onto the hyperplane ${H = \{x\ :\ a^T\cdot x\leq \beta\}}$, one needs a subgradient ${z^k\in\partial J(x^k)}$ (in fact an “admissible one” but for that detail see the paper) and then performs

$\displaystyle x^{k+1} = \nabla J^*(z^k - t_k a)$

(${J^*}$ is the convex dual of ${J}$) with some appropriate stepsize ${t_k}$ (which is the solution of a one-dimensional convex minimization problem). Moreover, ${z^{k+1} = z^k - t_k a}$ is a new admissible subgradient at ${x^{k+1}}$.

• If one has a problem with a constraint ${Ax=b}$ (formulated as an SFP in one way or another) the method converges to the minimum-${J}$ solution of the equation if ${J}$ is strongly convex.

Note that strong convexity of ${J}$ implies differentiability of ${J^*}$ and Lipschitz continuity of ${\nabla J}$ and hence, the Bregman projection can indeed be carried out.

Now one already sees how this relates to the linearized Bregman method: Setting ${J(x) = \lambda\|x\|_1 + \tfrac12\|x\|_2^2}$, a little calculation shows that

$\displaystyle \nabla J^*(z) = S_\lambda(z).$

Hence, using the formulation with a “single difficult constraint” leads to the linearized Bregman method with a specific stepsize. It turns out that this stepsize is a pretty good one but also that one can show that a constant stepsize also works as long as it is positive and smaller that ${2\|A\|^{-2}}$.

In the paper we present several examples how one can use the framework. I see one strengths of this approach that one can add convex constraints to a given problem without getting into any trouble with the algorithmic framework.

The second paper extends a remark that we make in the first one: If one applies the framework of the linearized Bregman method to the case in which one considers the system ${Ax=b}$ as ${m}$ simple (hyperplane-)constraints one obtains a sparse Kaczmarz solver. Indeed one can use the simple iteration

$\displaystyle \begin{array}{rcl} z^{k+1} & = &z^k - a_{r(k)}^T\frac{a_{r(k)}\cdot x^k - b_k}{\|a_{r(k)}\|_2^2}\\ x^{k+1} & = &S_\lambda(z^{k+1}) \end{array}$

and will converge to the same sparse solution as the linearized Bregman method.

This method has a nice application to “online compressed sensing”: We illustrate this in the paper with an example from radio interferometry. There, large arrays of radio telescopes collect radio emissions from the sky. Each pair of telescopes lead to a single measurement of the Fourier transform of the quantity of interest. Hence, for ${k}$ telescopes, each measurement gives ${k(k-1)/2}$ samples in the Fourier domain. In our example we used data from the Very Large Array telescope which has 27 telescopes leading to 351 Fourier samples. That’s not much, if one want a picture of the emission with several ten thousands of pixels. But the good thing is that the Earth rotates (that’s good for several reasons): When the Earth rotates relative to the sky, the sampling pattern also rotates. Hence, one waits a small amount of time and makes another measurement. Commonly, this is done until the earth has made a half rotation, i.e. one complete measurement takes 12 hours. With the “online compressed sensing” framework we proposed, one can start reconstructing the image as soon the first measurements have arrived. Interestingly, one observes the following behavior: If one monitors the residual of the equation, it goes down during iterations and jumps up when new measurements arrive. But from some point on, the residual stays small! This says that the new measurements do not contradict the previous ones and more interestingly this happened precisely when the reconstruction error dropped down such that “exact reconstruction” in the sense of compressed sensing has happened. In the example of radio interferometry, this happened after 2.5 hours!

Reconstruction by online compressed sensing

You can find slides of a talk I gave at the Sparse Tomo Days here.

In this post I will explore a bit the question on how to calculate the discrete gradient and the discrete divergence of an image and a vector field, respectively.

Let ${u_0\in{\mathbb R}^{N\times M}}$ be a discrete image, i.e. ${u_0(i,j)}$ denotes the gray value at the ${i,j}$-th pixel. The famous total variation denoising amounts to minimizing the functional

$\displaystyle \Phi(u) = \tfrac12 \int (u-u_0)^2 + \lambda\int|\nabla u|$

where the integral shall be understood as summation, ${\nabla}$ stand for the gradient, i.e. the vector of the partial derivatives, and ${|\cdot|}$ stands for the euclidean absolute value.

When using primal-dual methods, it is of crucial importance, that the used operators for the gradient and the divergence are numerically adjoint (up to the minus sign). That is, the numerical operations are adjoint in the following sense: If grad is the operation for the gradient and div is the operation for the divergence, then it should hold for any variables ${u}$ and ${v}$ of suitable size and with gradu = grad(u), divv = div(v) that sum(gradu(:).*v(:)) and -sum(u(:).*divv(:)) are equal up to numerical precision. Due to the boundary treatment, the internal MATLAB operations gradient and divergence do not fulfill this requirement.

The most common discretization of the gradient uses discrete forward differences and a constant padding at the boundary (which means that Neumann boundary values are applied). In formula, this reads as

$\displaystyle (D_xu)_{i,j} = \begin{cases} u_{i+1,j} - u_{i,j} & i

The respective adjoints are backward differences with zero boundary treatment (check it!). Apparently, there are many different ways to implement this routines (and the respective adjoints) in MATLAB. Here are four of them:

1. For loops: Run through all pixels in two for-loops and assign the difference to the output. Of course, one should preallocate the output prior to the loop. But you may probably know the first two rules of MATLAB coding? If not, here they are: 1. Avoid for-loops. 2. Avoid for-loops, seriously. I put the routines into extra functions and created anonymous function to call the gradient and the divergence as
div = @(V) dxm(V(:,:,1)) + dym(V(:,:,2));

2. Shift and subtract: MATLAB is great in using vectors and matrices. And to avoid the for-loop one could also implement the forward difference in ${x}$-direction by shifting the matrix and subtract the original one, i.e. [u(:,2:end) u(:,end)] - u (and similarly for the other differences). Again, I wrote extra functions and used anonymous function as above.
3. Small sparse matrices from the left and from the right: MATLAB is also pretty good with sparse matrices. Since the derivatives in ${x}$-direction only involve the subtraction of two elements in the same column, one can realize this by multiplying an image from the left with a sparse diagonal matrix with just two non-zero diagonals. Similarly, the derivative in ${y}$-direction can be realized by multiplying from the right with a suitable matrix. More precisely, this approach is realized by
Dy = spdiags([-ones(M,1) ones(M,1)],[0 1],M,M);
Dy(M,:) = 0;
Dx = spdiags([-ones(N,1) ones(N,1)],[0 1],N,N);
Dy(N,:) = 0;
Dxu = Dx*u;
Dyu = u*Dy';

(check it). Note that the adjoint of ${x}$-derivative if simple the operation Dx'*u and the operation of the ${y}$-derivative is u*Dy. Together, the calculation of the gradient and the divergence was done by the anonymous functions

div = @(V) V(:,:,1)*Dx + Dy'*V(:,:,2);

4. Large sparse matrices: One could think about the following: Vectorize the image by U = u(:) (which amounts to stacking the columns above each other). Then assemble a large ${NM\times NM}$ sparse matrix which has just two non-zero diagonals to do the forward (and other) differences. More precisely this can be done by
(with Dx and Dy from above)

DX = kron(Dx,speye(M));
DY = kron(speye(N),Dy);
DxU = DX*U;
DyU = DY*U;

Here, it is clear that the respective adjoint are just the multiplication with the transposed matrices. Here, the anonymous functions are

div = @(V) DX'*V(:,1) + DY'*V(:,2);

The different approaches have different pros and cons. Well, the for-loop only has cons: It is presumably slow and uses a lot indexing which easily leads to bugs. The shift-and-subtract method should go into an extra function to make it easy to use – but this is not necessarily a drawback. For the multiplication with the large matrix, one has to vectorize the image first and every time one wants to look at the result and need to do reshape(U,N,M). But let’s focus on speed: I implemented all methods, and let the run on square images of different sizes for 50 times (after a warmup) and measures the execution times with tic and toc. The assembly of the matrices did not enter the timing. Moreover, no parallel computation was used – just a single core. Finally, memory was not an issue since the larges matrices (of size ${2500\times 2500\times 2}$) only need roughly 100MB of memory.

Here is the table with the average times for one calculation of the gradient (in seconds):

 ${N}$ For-loop Shift-subtract left-right left 100 0.0004 0.0002 0.0001 0.0003 200 0.0018 0.0010 0.0005 0.0015 300 0.0057 0.0011 0.0014 0.0020 400 0.0096 0.0031 0.0022 0.0035 500 0.0178 0.0035 0.0030 0.0054 750 0.0449 0.0114 0.0097 0.0123 1000 0.0737 0.0189 0.0128 0.0212 1500 0.2055 0.0576 0.0379 0.0601 2000 0.3942 0.0915 0.0671 0.1136 2500 0.6719 0.1571 0.1068 0.1788

and here is the one for the divergences:

 ${N}$ For-loop Shift-subtract left-right left 100 0.0004 0.0003 0.0002 0.0002 200 0.0018 0.0015 0.0005 0.0008 300 0.0048 0.0016 0.0015 0.0012 400 0.0090 0.0036 0.0020 0.0022 500 0.0158 0.0057 0.0027 0.0035 750 0.0409 0.0132 0.0073 0.0069 1000 0.0708 0.0238 0.0130 0.0125 1500 0.2008 0.0654 0.0344 0.0370 2000 0.3886 0.1285 0.0622 0.0671 2500 0.6627 0.2512 0.1084 0.1361

As expected, the for-loop is clearly slower and also as expected, all methods basically scale quadratically (doubling ${N}$ amounts to multiplying the running time by four) since the work per pixel is constant. A little surprisingly, the multiplication from the left and from the right is fastest and also consistently a little faster then multiplication from the left with the larger sparse matrix. I don’t know, why the results are that different for the gradient and for the divergence. Maybe this is related to my use on anonymous functions or the allocation of memory?

Another few notes to myself:

ISMP is over now and I’m already home. I do not have many things to report on from the last day. This is not due the lower quality of the talks but due to the fact that I was a little bit exhausted, as usual at the end of a five-day conference. However, I collect a few things for the record:

• In the morning I visited the semi-planary by Xiaojun Chenon non-convex and non-smooth minimization with smoothing methods. Not surprisingly, she treated the problem

$\displaystyle \min_x f(x) + \|x\|_p^p$

with convex and smooth ${f:{\mathbb R}^n\rightarrow{\mathbb R}}$ and ${0. She proposed and analyzed smoothing methods, that is, to smooth the problem a bit to obtain a Lipschitz-continuous objective function ${\phi_\epsilon}$, minimizing this and then gradually decreasing ${\epsilon}$. This works, as she showed. If I remember correctly, she also treated “iteratively reweighted least squares” as I described in my previous post. Unfortunately, she did not include the generalized forward-backward methods based on ${\text{prox}}$-functions for non-convex functions. Kristian and I pursued this approach in our paper Minimization of non-smooth, non-convex functionals by iterative thresholding and some special features of our analysis include:

• A condition which excludes some (but not all) local minimizers from being global.
• An algorithm which avoids this non-global minimizers by carefully adjusting the steplength of the method.
• A result that the number of local minimizers is still finite, even if the problem is posed in ${\ell^2({\mathbb N})}$ and not in ${{\mathbb R}^n}$.

Most of our results hold true, if the ${p}$-quasi-norm is replaced by functions of the form

$\displaystyle \sum_n \phi_n(|x_n|)$

with special non-convex ${\phi}$, namely fulfilling a list of assumptions like

• ${\phi'(x) \rightarrow \infty}$ for ${x\rightarrow 0}$ (infinite slope at ${0}$) and ${\phi(x)\rightarrow\infty}$ for ${x\rightarrow\infty}$ (mild coercivity),
• ${\phi'}$ strictly convex on ${]0,\infty[}$ and ${\phi'(x)/x\rightarrow 0}$ for ${x\rightarrow\infty}$,
• for each ${b>0}$ there is ${a>0}$ such that for ${x it holds that ${\phi(x)>ax^2}$, and
• local integrability of some section of ${\partial\phi'(x) x}$.

As one easily sees, ${p}$-quasi-norms fulfill the assumptions and some other interesting functions as well (e.g. some with very steep slope at ${0}$ like ${x\mapsto \log(x^{1/3}+1)}$).

• Jorge Nocedalgave a talk on second-order methods for non-smooth problems and his main example was a functional like

$\displaystyle \min_x f(x) + \|x\|_1$

with a convex and smooth ${f}$, but different from Xiaojun Chen, he only considered the ${1}$-norm. His talked is among the best planary talks I have ever attended and it was a great pleasure to listen to him. He carefully explained things and put them in perspective. In the case he skipped slides, he made me feel that I either did not miss an important thing, or understood them even though he didn’t show them He argued that it is not necessarily more expensive to use second order information in contrast to first order methods. Indeed, the ${1}$-norm can be used to reduce the number of degrees of freedom for a second order step. What was pretty interesting is, that he advocated semismooth Newton methods for this problem. Roland and I pursued this approach some time ago in our paper A Semismooth Newton Method for Tikhonov Functionals with Sparsity Constraints and, if I remember correctly (my notes are not complete at this point), his family of methods included our ssn-method. The method Roland and I proposed worked amazingly well in the cases in which it converged but the method suffered from non-global convergence. We had some preliminary ideas for globalization, which we could not tune enough to retain the speed of the method, and abandoned the topic. Now, that the topic will most probably be revived by the community, I am looking forward to fresh ideas here.

Today there are several things I could blog on. The first is the planary by Rich Baraniuk on Compressed Sensing. However, I don’t think that I could reflect the content in a way which would be helpful for a potential reader. Just for the record: If you have the chance to visit one of Rich’s talk: Do it!

The second thing is the talk by Bernd Hofmann on source conditions, smoothness and variational inequalities and their use in regularization of inverse problems. However, this would be too technical for now and I just did not take enough notes to write a meaningful post.

As a third thing I have the talk by Christian Clason on inverse problems with uniformly distributed noise. He argued that for uniform noise it is much better to use an ${L^\infty}$ discrepancy term instead of the usual ${L^2}$-one. He presented a path-following semismooth Newton method to solve the problem

$\displaystyle \min_x \frac{1}{p}\|Kx-y^\delta\|_\infty^p + \frac{\alpha}{2}\|x\|_2^2$

and showed examples with different kinds of noise. Indeed the examples showed that ${L^\infty}$ works much better than ${L^2}$ here. But in fact it works even better, if the noise is not uniformly distributed but “impulsive” i.e. it attains bounds ${\pm\delta}$ almost everywhere. It seems to me that uniform noise would need a slightly different penalty but I don’t know which one – probably you do? Moreover, Christian presented the balancing principle to choose the regularization parameter (without knowledge about the noise level) and this was the first time I really got what it’s about. What one does here is, to choose ${\alpha}$ such that (for some ${\sigma>0}$ which only depends on ${K}$, but not on the noise)

$\displaystyle \sigma\|Kx_\alpha^\delta-y^\delta\|_\infty = \frac{\alpha}{2}\|x_\alpha^\delta\|_2^2.$

The rational behind this is, that the left hand side is monotonically non-decreasing in ${\alpha}$, while the right hand side is monotonically non-increasing. Hence, there should be some ${\alpha}$ “in the middle” which make both somewhat equally large. Of course, we do neither want to “over-regularize” (which would usually “smooth too much”) nor to “under-regularize” (which would not eliminate noise). Hence, balancing seems to be a valid choice. From a practical point of view the balancing is also nice because one can use the fixed-point iteration

$\displaystyle \alpha^{n+1} = 2\sigma\frac{\|Kx_{\alpha^n}^\delta - y^\delta\|_\infty}{\|x_{\alpha_n}^\delta\|_2^2}$

which converges in a few number of iterations.

Then there was the talk by Esther Klann, but unfortunately, I was late so only heard the last half…

Last but not least we have the talk by Christiane Pöschl. If you are interested in Total-Variation-Denoising (TV denoising), then you probably have heard many times that “TV denoising preserves edges” (have a look at the Wikipedia page – it claims this twice). What Christiane showed (in a work with Vicent Caselles and M. Novaga) that this claim is not true in general but only for very special cases. In case of characteristic functions, the only functions for which the TV minimizer has sharp edges are these so-called calibrated sets, introduced by Caselles et el. Building on earlier works by Caselles and co-workers she calculated exact minimizers for TV denoising in the case that the image consists of characteristic functions of two convex sets or of a single star shaped domain, that is, for a given set $B$ she calculated the solution of

$\displaystyle \min_u\int (u - \chi_B)^2dx + \lambda \int|Du|.$

This is not is as easy as it may sound. Even for the minimizer for a single convex set one has to make some effort. She presented a nice connection of the shape of the obtained level-sets with the morphological operators of closing and opening. With the help of this link she derived a methodology to obtain the exact TV denoising minimizer for all parameters. I do not have the images right now but be assured that most of the time, the minimizers do not have sharp edges all over the place. Even for simple geometries (like two rectangles touching in a corner) strange things happen and only very few sharp edges appear. I’ll keep you posted in case the paper comes out (or appears as a preprint).

Christiane has some nice images which make this much more clear:

For two circles edges are preserved if they are far enough away from each other. If they are close, the area “in between” them is filled and, moreover, obey this fuzzy boundary. I remember myself seeing effects like this in the output of TV-solvers and thinking “well, it seems that the algorithm is either not good or not converged yet – TV should output sharp edges!”.

For a star-shaped shape (well, actually a star) the output looks like this. The corners are not only rounded but also blurred and this is true both for the “outer” corners and the “inner” corners.

So, if you have any TV-minimizing code, go ahead and check if your code actually does the right things on images like this!
Moreover, I would love to see similar results for more complicated extensions of TV like Total Generalized Variation, I treated here.

The scientific program at ISMP started today and I planned to write a small personal summary of each day. However, it is a very intense meeting. Lot’s of excellent talks, lot’s of people to meet and little spare time. So I’m afraid that I have to deviate from my plan a little bit. Instead of a summary of every day I just pick out a few events. I remark that these picks do not reflect quality, significance or something like this in any way. I just pick things for which I have something to record for personal reasons.

My day started after the first plenary which the session Testing environments for machine learning and compressed sensing in which my own talk was located. The session started with the talk by Michael Friedlander of the SPOT toolbox. Haven’t heard of SPOT yet? Take a look! In a nutshell its a toolbox which turns MATLAB into “OPLAB”, i.e. it allows to treat abstract linear operators like matrices. By the way, the code is on github.

The second talk was by Katya Scheinberg (who is giving a semi-planary talk on derivative free optimization at the moment…). She talked about speeding up FISTA by cleverly adjusting step-sizes and over-relaxation parameters and generalizing these ideas to other methods like alternating direction methods. Notably, she used the “SPEAR test instances” from our project homepage! (And credited them as “surprisingly hard sparsity problems”.)

My own talk was the third and last one in that session. I talked about the issue of constructing test instance for Basis Pursuit Denoising. I argued that the naive approach (which takes a matrix ${A}$, a right hand side ${b}$ and a parameter ${\lambda}$ and let some great solver run for a while to obtain a solution ${x^*}$) may suffer from “trusted method bias”. I proposed to use “reverse instance construction” which is: First choose ${A}$, ${\lambda}$ and the solution ${x^*}$ and the construct the right hand side ${b}$ (I blogged on this before here).

Last but not least, I’d like to mention the talk by Thomas Pock: He talked about parameter selection on variational models (think of the regularization parameter in Tikhonov, for example). In a paper with Karl Kunisch titled A bilevel optimization approach for parameter learning in variational models they formulated this as a bi-level optimization problem. An approach which seemed to have been overdue! Although they treat somehow simple inverse problems (well, denoising) (but with not so easy regularizers) it is a promising first step in this direction.

In this post I just collect a few papers that caught my attention in the last moth.

I begin with Estimating Unknown Sparsity in Compressed Sensing by Miles E. Lopes. The abstract reads

Within the framework of compressed sensing, many theoretical guarantees for signal reconstruction require that the number of linear measurements ${n}$ exceed the sparsity ${\|x\|_0}$ of the unknown signal ${x\in\mathbb{R}^p}$. However, if the sparsity ${\|x\|_0}$ is unknown, the choice of ${n}$ remains problematic. This paper considers the problem of estimating the unknown degree of sparsity of ${x}$ with only a small number of linear measurements. Although we show that estimation of ${\|x\|_0}$ is generally intractable in this framework, we consider an alternative measure of sparsity ${s(x):=\frac{\|x\|_1^2}{\|x\|_2^2}}$, which is a sharp lower bound on ${\|x\|_0}$, and is more amenable to estimation. When ${x}$ is a non-negative vector, we propose a computationally efficient estimator ${\hat{s}(x)}$, and use non-asymptotic methods to bound the relative error of ${\hat{s}(x)}$ in terms of a finite number of measurements. Remarkably, the quality of estimation is dimension-free, which ensures that ${\hat{s}(x)}$ is well-suited to the high-dimensional regime where ${n<. These results also extend naturally to the problem of using linear measurements to estimate the rank of a positive semi-definite matrix, or the sparsity of a non-negative matrix. Finally, we show that if no structural assumption (such as non-negativity) is made on the signal ${x}$, then the quantity ${s(x)}$ cannot generally be estimated when ${n<.

It’s a nice combination of the observation that the quotient ${s(x)}$ is a sharp lower bound for ${\|x\|_0}$ and that it is possible to estimate the one-norm and the two norm of a vector ${x}$ (with additional properties) from carefully chosen measurements. For a non-negative vector ${x}$ you just measure with the constant-one vector which (in a noisy environment) gives you an estimate of ${\|x\|_1}$. Similarly, measuring with Gaussian random vector you can obtain an estimate of ${\|x\|_2}$.

Then there is the dissertation of Dustin Mixon on the arxiv: Sparse Signal Processing with Frame Theory which is well worth reading but too long to provide a short overview. Here is the abstract:

Many emerging applications involve sparse signals, and their processing is a subject of active research. We desire a large class of sensing matrices which allow the user to discern important properties of the measured sparse signal. Of particular interest are matrices with the restricted isometry property (RIP). RIP matrices are known to enable eﬃcient and stable reconstruction of sfficiently sparse signals, but the deterministic construction of such matrices has proven very dfficult. In this thesis, we discuss this matrix design problem in the context of a growing field of study known as frame theory. In the ﬁrst two chapters, we build large families of equiangular tight frames and full spark frames, and we discuss their relationship to RIP matrices as well as their utility in other aspects of sparse signal processing. In Chapter 3, we pave the road to deterministic RIP matrices, evaluating various techniques to demonstrate RIP, and making interesting connections with graph theory and number theory. We conclude in Chapter 4 with a coherence-based alternative to RIP, which provides near-optimal probabilistic guarantees for various aspects of sparse signal processing while at the same time admitting a whole host of deterministic constructions.

By the way, the thesis is dedicated “To all those who never dedicated a dissertation to themselves.”

Further we have Proximal Newton-type Methods for Minimizing Convex Objective Functions in Composite Form by Jason D Lee, Yuekai Sun, Michael A. Saunders. This paper extends the well explored first order methods for problem of the type ${\min g(x) + h(x)}$ with Lipschitz-differentiable ${g}$ or simple ${\mathrm{prox}_h}$ to second order Newton-type methods. The abstract reads

We consider minimizing convex objective functions in composite form

$\displaystyle \min_{x\in\mathbb{R}^n} f(x) := g(x) + h(x)$

where ${g}$ is convex and twice-continuously differentiable and ${h:\mathbb{R}^n\rightarrow\mathbb{R}}$ is a convex but not necessarily differentiable function whose proximal mapping can be evaluated efficiently. We derive a generalization of Newton-type methods to handle such convex but nonsmooth objective functions. Many problems of relevance in high-dimensional statistics, machine learning, and signal processing can be formulated in composite form. We prove such methods are globally convergent to a minimizer and achieve quadratic rates of convergence in the vicinity of a unique minimizer. We also demonstrate the performance of such methods using problems of relevance in machine learning and high-dimensional statistics.

With this post I say goodbye for a few weeks of holiday.

Today I would like to comment on two arxiv-preprints I stumbled upon:

1. “Augmented L1 and Nuclear-Norm Models with a Globally Linearly Convergent Algorithm” – The Elastic Net rediscovered

The paper “Augmented L1 and Nuclear-Norm Models with a Globally Linearly Convergent Algorithm” by Ming-Jun Lai and Wotao Yin is another contribution to a field which is (or was?) probably the fastest growing field in applied mathematics: Algorithms for convex problems with non-smooth ${\ell^1}$-like terms. The “mother problem” here is as follows: Consider a matrix ${A\in{\mathbb R}^{m\times n}}$, ${b\in{\mathbb R}^m}$ try to find a solution of

$\displaystyle \min_{x\in{\mathbb R}^n}\|x\|_1\quad\text{s.t.}\quad Ax=b$

or, for ${\sigma>0}$

$\displaystyle \min_{x\in{\mathbb R}^n}\|x\|_1\quad\text{s.t.}\quad \|Ax-b\|\leq\sigma$

which appeared here on this blog previously. Although this is a convex problem and even has a reformulation as linear program, some instances of this problem are notoriously hard to solve and gained a lot of attention (because their applicability in sparse recovery and compressed sensing). Very roughly speaking, a part of its hardness comes from the fact that the problem is neither smooth nor strictly convex.

The contribution of Lai and Yin is that they analyze a slight perturbation of the problem which makes its solution much easier: They add another term in the objective; for ${\alpha>0}$ they consider

$\displaystyle \min_{x\in{\mathbb R}^n}\|x\|_1 + \frac{1}{2\alpha}\|x\|_2^2\quad\text{s.t.}\quad Ax=b$

or

$\displaystyle \min_{x\in{\mathbb R}^n}\|x\|_1 + \frac{1}{2\alpha}\|x\|_2^2\quad\text{s.t.}\quad \|Ax-b\|\leq\sigma.$

This perturbation does not make the problem smooth but renders it strongly convex (which usually makes the dual more smooth). It turns out that this perturbation makes life with this problem (and related ones) much easier – recovery guarantees still exists and algorithms behave better.

I think it is important to note that the “augmentation” of the ${\ell^1}$ objective with an additional squared ${\ell^2}$-term goes back to Zou and Hastie from the statistics community. There, the motivation was as follows: They observed that the pure ${\ell^1}$ objective tends to “overpromote” sparsity in the sense that if there are two columns in ${A}$ which are almost equally good in explaining some component of ${b}$ then only one of them is used. The “augmented problem”, however, tends to use both of them. They coined the method as “elastic net” (for reasons which I never really got).

I also worked on elastic-net problems for problems in the form

$\displaystyle \min_x \frac{1}{2}\|Ax-b\|^2 + \alpha\|x\|_1 + \frac{\beta}{2}\|x\|_2^2$

in this paper (doi-link). Here it also turns out that the problem gets much easier algorithmically. I found it very convenient to rewrite the elastic-net problem as

$\displaystyle \min_x \frac{1}{2}\|\begin{bmatrix}A\\ \sqrt{\beta} I\end{bmatrix}x-\begin{bmatrix}b\\ 0\end{bmatrix}\|^2 + \alpha\|x\|_1$

which turns the elastic-net problem into just another ${\ell^1}$-penalized problem with a special matrix and right hand side. Quite convenient for analysis and also somehow algorithmically.

2. Towards a Mathematical Theory of Super-Resolution

The second preprint is “Towards a Mathematical Theory of Super-Resolution” by Emmanuel Candes and Carlos Fernandez-Granda.

The idea of super-resolution seems to pretty old and, very roughly speaking, is to extract a higher resolution of a measured quantity (e.g. an image) than the measured data allows. Of course, in this formulation this is impossible. But often one can gain something by additional knowledge of the image. Basically, this also is the idea behind compressed sensing and hence, it does not come as a surprise that the results in compressed sensing are used to try to explain when super-resolution is possible.

The paper by Candes and Fernandez-Granada seems to be pretty close in spirit to Exact Reconstruction using Support Pursuit on which I blogged earlier. They model the sparse signal as a Radon measure, especially as a sum of Diracs. However, different from the support-pursuit-paper they use complex exponentials (in contrast to real polynomials). Their reconstruction method is basically the same as support pursuit: The try to solve

$\displaystyle \min_{x\in\mathcal{M}} \|x\|\quad\text{s.t.}\quad Fx=y, \ \ \ \ \ (1)$

i.e. they minimize over the set of Radon measures ${\mathcal{M}}$ under the constraint that certain measurements ${Fx\in{\mathbb R}^n}$ result in certain given values ${y}$. Moreover, they make a thorough analysis of what is “reconstructable” by their ansatz and obtain a lower bound on the distance of two Diracs (in other words, a lower bound in the Prokhorov distance). I have to admit that I do not share one of their claims from the abstract: “We show that one can super-resolve these point sources with infinite precision—i.e. recover the exact locations and amplitudes—by solving a simple convex program.” My point is that I can not see to what extend the problem (1) is a simple one. Well, it is convex, but it does not seem to be simple.

I want to add that the idea of “continuous sparse modelling” in the space of signed measures is very appealing to me and appeared first in Inverse problems in spaces of measures by Kristian Bredies and Hanna Pikkarainen.

Recently the recipients of Sloan Fellowships for 2012 has been announced. This is a kind grant/price awarded to young scientists, usually people who are assistant professors on the tenure track, from the U.S. and Canada. While the actual award is not exorbitant (but still large) the Sloan Fellowships seem to be a good indicator for further success in the career and, more importantly, for awaited breakthroughs by the fellows.

This year there are 20 mathematicians among the recipients and it appeared that I knew three of them:

• Rachel Ward (UTexas at Austin), a student of Ingrid Daubechies, has written a very nice PhD Thesis “Freedom through imperfection” on signal processing based on redundancy. Among several interesting works she has a very recent preprint Stable image reconstruction using total variation minimization which I plan to cover in a future post (according to the abstract the paper provides rigorous theory for exact recovery with discrete total variation which is cool).
• Ben Recht (University of Wisconsin, Madison) also works in the field of signal processing (among other fields) and is especially known for his work on exact matrix completion with compressed sensing techniques (together with Emmanuel Candes); I also liked his paper “The Convex Geometry of Linear Inverse Problems” (with Venkat Chandrasekaran, Pablo A. Parrilo, and Alan Willsky) with elaborated on the generalization of ${\ell^1}$-minimization and nuclear norm minimization.
• Greg Blekherman (Georgia Tech) works (among other things) on algebraic geometry and especially on the problem which non-negative polynomials (in more the one variable) can be written as sums of squares of polynomials, a question which is related to one of Hilbert’s 23 problems, namely Hilbert’s seventeenth problem. An interesting thing about non-negative polynomials and sums of squares is that: 1. Checking if a polynomial is non-negative is NP-hard. 2. Checking is a polynomial is a sum of squares can be done fast. 3. It seem that “most” non-negative polynomials are actually sums of squares but it unclear “how many of them”, see Greg’s chapter of the forthcoming book “Semidefinite Optimization and Convex Algebraic Geometry” here.

Congratulations!

Today I write about sparse recovery of “multidimensional” signal. With “multidimensional” I mean something like this: A one-dimensional signal is a vector ${x\in{\mathbb R}^n}$ while a two-dimensional signal is a matrix ${x\in{\mathbb R}^{n_1\times n_2}}$. Similarly, ${d}$-dimensional signal is ${x\in{\mathbb R}^{n_1\times\cdots\times n_d}}$. Of course, images as two-dimensional signals, come time mind. Moreover, movies are three-dimensional, a hyperspectral 2D image (which has a whole spectrum attached to any pixel) is also three-dimensional), and time-dependent volume-data is four-dimensional.

Multidimensional data is often a challenge due the large amount of data. While the size of the signals is usually not the problem it is more the size of the measurement matrices. In the context of compressed sensing or sparse recovery the signal is measured with a linear operator, i.e. one applies a number ${m}$ of linear functionals to the signal. In the ${d}$-dimensional case this can be encoded as a matrix ${A\in {\mathbb R}^{m\times \prod_1^d n_i}}$ and this is where the trouble with the data comes in: If you have megapixel image (which is still quite small) the matrix has a million of columns and if you have a dense matrix, storage becomes an issue.

One approach (which is indeed quite old) to tackle this problem is, to consider special measurement matrices: If the signal has a sparse structure is every slice, i.e. every vectors of the form ${x(i_1,\dots,i_{k-1},:,i_{k+1},\dots,i_d)}$ where we fix all but the ${k}$-th component, then the Kronecker product of measurement matrices for each dimension is the right thing.

The Kronecker product of two matrices ${A\in{\mathbb R}^{m\times n}}$ and ${B\in{\mathbb R}^{k\times j}}$ is the ${mk\times nj}$ matrix

$\displaystyle A\otimes B = \begin{bmatrix} a_{11}B & \dots & a_{1n}B\\ \vdots & & \vdots\\ a_{n1}B & \dots & a_{nn}B \end{bmatrix}.$

This has a lot to do with the tensor product and you should read the Wikipedia entry. Moreover, it is numerically advantageous not to build the Kronecker product of dense matrices if you only want to apply it to a given signal. To see this, we introduce the vectorization operator ${\text{vec}:{\mathbb R}^{m\times n}\rightarrow{\mathbb R}^{nm}}$ which takes a matrix ${X}$ and stacks its columns into a tall column vector. For matrices ${A}$ and ${B}$ (of fitting sizes) it holds that

$\displaystyle (B^T\otimes A)\text{vec}(X) = \text{vec}(AXB).$

So, multiplying ${X}$ from the left and from the right gives the application of the Kronecker product.

The use of Kronecker products in numerical linear algebra is fairly old (for example they are helpful for multidimensional finite difference schemes where you can build Kronecker products of sparse difference operators in respective dimensions). Recently, they have been discovered for compressed sensing and sparse recovery in these two papers: Sparse solutions to underdetermined Kronecker product systems by Sadegh Jokar and Volker Mehrmann and the more recent Kronecker Compressed Sensing by Marco Duarte and Rich Baraniuk.

From these papers you can extract some interestingly simple and nice theorems:

Theorem 1 For matrices ${A_1,\dots A_d}$ with restricted isometry constant ${\delta_K(A_1),\dots,\delta_K(A_d)}$ of order ${K}$ it holds that the restricted isometry constant of the Kronecker product fulfills

$\displaystyle \max_i \delta_K(A_i) \leq \delta_K(A_1\otimes\cdots\otimes A_d) \leq \prod_1^d (1+\delta_K(A_i))-1.$

Basically, the RIP constant of a Kronecker product is not better than the worst one but still not too large.

Theorem 2 For matrices ${A_1,\dots, A_d}$ with columns normalized to one, it hold that the spark of their Kronecker product fulfills

$\displaystyle \text{spark}(A\otimes \dots\otimes A_d) = = \min_i\text{spark}(A_i).$

Theorem 3 For matrices ${A_1,\dots, A_d}$ with columns normalized to one, it hold that the mutual coherence of their Kronecker product fulfills

$\displaystyle \mu(A_1\otimes\dots\otimes A_d) = \max_i \mu(A_i).$

Next Page »