ISMP is over now and I’m already home. I do not have many things to report on from the last day. This is not due the lower quality of the talks but due to the fact that I was a little bit exhausted, as usual at the end of a five-day conference. However, I collect a few things for the record:

• In the morning I visited the semi-planary by Xiaojun Chenon non-convex and non-smooth minimization with smoothing methods. Not surprisingly, she treated the problem

$\displaystyle \min_x f(x) + \|x\|_p^p$

with convex and smooth ${f:{\mathbb R}^n\rightarrow{\mathbb R}}$ and ${0. She proposed and analyzed smoothing methods, that is, to smooth the problem a bit to obtain a Lipschitz-continuous objective function ${\phi_\epsilon}$, minimizing this and then gradually decreasing ${\epsilon}$. This works, as she showed. If I remember correctly, she also treated “iteratively reweighted least squares” as I described in my previous post. Unfortunately, she did not include the generalized forward-backward methods based on ${\text{prox}}$-functions for non-convex functions. Kristian and I pursued this approach in our paper Minimization of non-smooth, non-convex functionals by iterative thresholding and some special features of our analysis include:

• A condition which excludes some (but not all) local minimizers from being global.
• An algorithm which avoids this non-global minimizers by carefully adjusting the steplength of the method.
• A result that the number of local minimizers is still finite, even if the problem is posed in ${\ell^2({\mathbb N})}$ and not in ${{\mathbb R}^n}$.

Most of our results hold true, if the ${p}$-quasi-norm is replaced by functions of the form

$\displaystyle \sum_n \phi_n(|x_n|)$

with special non-convex ${\phi}$, namely fulfilling a list of assumptions like

• ${\phi'(x) \rightarrow \infty}$ for ${x\rightarrow 0}$ (infinite slope at ${0}$) and ${\phi(x)\rightarrow\infty}$ for ${x\rightarrow\infty}$ (mild coercivity),
• ${\phi'}$ strictly convex on ${]0,\infty[}$ and ${\phi'(x)/x\rightarrow 0}$ for ${x\rightarrow\infty}$,
• for each ${b>0}$ there is ${a>0}$ such that for ${x it holds that ${\phi(x)>ax^2}$, and
• local integrability of some section of ${\partial\phi'(x) x}$.

As one easily sees, ${p}$-quasi-norms fulfill the assumptions and some other interesting functions as well (e.g. some with very steep slope at ${0}$ like ${x\mapsto \log(x^{1/3}+1)}$).

• Jorge Nocedalgave a talk on second-order methods for non-smooth problems and his main example was a functional like

$\displaystyle \min_x f(x) + \|x\|_1$

with a convex and smooth ${f}$, but different from Xiaojun Chen, he only considered the ${1}$-norm. His talked is among the best planary talks I have ever attended and it was a great pleasure to listen to him. He carefully explained things and put them in perspective. In the case he skipped slides, he made me feel that I either did not miss an important thing, or understood them even though he didn’t show them He argued that it is not necessarily more expensive to use second order information in contrast to first order methods. Indeed, the ${1}$-norm can be used to reduce the number of degrees of freedom for a second order step. What was pretty interesting is, that he advocated semismooth Newton methods for this problem. Roland and I pursued this approach some time ago in our paper A Semismooth Newton Method for Tikhonov Functionals with Sparsity Constraints and, if I remember correctly (my notes are not complete at this point), his family of methods included our ssn-method. The method Roland and I proposed worked amazingly well in the cases in which it converged but the method suffered from non-global convergence. We had some preliminary ideas for globalization, which we could not tune enough to retain the speed of the method, and abandoned the topic. Now, that the topic will most probably be revived by the community, I am looking forward to fresh ideas here.

Today there are several things I could blog on. The first is the planary by Rich Baraniuk on Compressed Sensing. However, I don’t think that I could reflect the content in a way which would be helpful for a potential reader. Just for the record: If you have the chance to visit one of Rich’s talk: Do it!

The second thing is the talk by Bernd Hofmann on source conditions, smoothness and variational inequalities and their use in regularization of inverse problems. However, this would be too technical for now and I just did not take enough notes to write a meaningful post.

As a third thing I have the talk by Christian Clason on inverse problems with uniformly distributed noise. He argued that for uniform noise it is much better to use an ${L^\infty}$ discrepancy term instead of the usual ${L^2}$-one. He presented a path-following semismooth Newton method to solve the problem

$\displaystyle \min_x \frac{1}{p}\|Kx-y^\delta\|_\infty^p + \frac{\alpha}{2}\|x\|_2^2$

and showed examples with different kinds of noise. Indeed the examples showed that ${L^\infty}$ works much better than ${L^2}$ here. But in fact it works even better, if the noise is not uniformly distributed but “impulsive” i.e. it attains bounds ${\pm\delta}$ almost everywhere. It seems to me that uniform noise would need a slightly different penalty but I don’t know which one – probably you do? Moreover, Christian presented the balancing principle to choose the regularization parameter (without knowledge about the noise level) and this was the first time I really got what it’s about. What one does here is, to choose ${\alpha}$ such that (for some ${\sigma>0}$ which only depends on ${K}$, but not on the noise)

$\displaystyle \sigma\|Kx_\alpha^\delta-y^\delta\|_\infty = \frac{\alpha}{2}\|x_\alpha^\delta\|_2^2.$

The rational behind this is, that the left hand side is monotonically non-decreasing in ${\alpha}$, while the right hand side is monotonically non-increasing. Hence, there should be some ${\alpha}$ “in the middle” which make both somewhat equally large. Of course, we do neither want to “over-regularize” (which would usually “smooth too much”) nor to “under-regularize” (which would not eliminate noise). Hence, balancing seems to be a valid choice. From a practical point of view the balancing is also nice because one can use the fixed-point iteration

$\displaystyle \alpha^{n+1} = 2\sigma\frac{\|Kx_{\alpha^n}^\delta - y^\delta\|_\infty}{\|x_{\alpha_n}^\delta\|_2^2}$

which converges in a few number of iterations.

Then there was the talk by Esther Klann, but unfortunately, I was late so only heard the last half…

Last but not least we have the talk by Christiane Pöschl. If you are interested in Total-Variation-Denoising (TV denoising), then you probably have heard many times that “TV denoising preserves edges” (have a look at the Wikipedia page – it claims this twice). What Christiane showed (in a work with Vicent Caselles and M. Novaga) that this claim is not true in general but only for very special cases. In case of characteristic functions, the only functions for which the TV minimizer has sharp edges are these so-called calibrated sets, introduced by Caselles et el. Building on earlier works by Caselles and co-workers she calculated exact minimizers for TV denoising in the case that the image consists of characteristic functions of two convex sets or of a single star shaped domain, that is, for a given set $B$ she calculated the solution of

$\displaystyle \min_u\int (u - \chi_B)^2dx + \lambda \int|Du|.$

This is not is as easy as it may sound. Even for the minimizer for a single convex set one has to make some effort. She presented a nice connection of the shape of the obtained level-sets with the morphological operators of closing and opening. With the help of this link she derived a methodology to obtain the exact TV denoising minimizer for all parameters. I do not have the images right now but be assured that most of the time, the minimizers do not have sharp edges all over the place. Even for simple geometries (like two rectangles touching in a corner) strange things happen and only very few sharp edges appear. I’ll keep you posted in case the paper comes out (or appears as a preprint).

Christiane has some nice images which make this much more clear:

For two circles edges are preserved if they are far enough away from each other. If they are close, the area “in between” them is filled and, moreover, obey this fuzzy boundary. I remember myself seeing effects like this in the output of TV-solvers and thinking “well, it seems that the algorithm is either not good or not converged yet – TV should output sharp edges!”.

For a star-shaped shape (well, actually a star) the output looks like this. The corners are not only rounded but also blurred and this is true both for the “outer” corners and the “inner” corners.

So, if you have any TV-minimizing code, go ahead and check if your code actually does the right things on images like this!
Moreover, I would love to see similar results for more complicated extensions of TV like Total Generalized Variation, I treated here.

Today I report on two things I came across here at ISMP:

• The first is a talk by Russell Luke on Constraint qualifications for nonconvex feasibility problems. Luke treated the NP-hard problem of sparsest solutions of linear systems. In fact he did not tackle this problem but the problem to find an ${s}$-sparse solution of an ${m\times n}$ system of equations. He formulated this as a feasibility-problem (well, Heinz Bauschke was a collaborator) as follows: With the usual malpractice let us denote by ${\|x\|_0}$ the number of non-zero entries of ${x\in{\mathbb R}^n}$. Then the problem of finding an ${s}$-sparse solution to ${Ax=b}$ is:

$\displaystyle \text{Find}\ x\ \text{in}\ \{\|x\|_0\leq s\}\cap\{Ax=b\}.$

In other words: find a feasible point, i.e. a point which lies in the intersection of the two sets. Well, most often feasibility problems involve convex sets but here, the first one given by this “${0}$-norm” is definitely not convex. One of the simplest algorithms for the convex feasibility problem is to alternatingly project onto both sets. This algorithm dates back to von Neumann and has been analyzed in great detail. To make this method work for non-convex sets one only needs to know how to project onto both sets. For the case of the equality constraint ${Ax=b}$ one can use numerical linear algebra to obtain the projection. The non-convex constraint on the number of non-zero entries is in fact even easier: For ${x\in{\mathbb R}^n}$ the projection onto ${\{\|x\|_0\leq s\}}$ consists of just keeping the ${s}$ largest entries of ${x}$ while setting the others to zero (known as the “best ${s}$-term approximation”). However, the theory breaks down in the case of non-convex sets. Russell treated problem in several papers (have a look at his publication page) and in the talk he focused on the problem of constraint qualification, i.e. what kind of regularity has to be imposed on the intersection of the two sets. He could shows that (local) linear convergence of the algorithm (which is observed numerically) can indeed be justified theoretically. One point which is still open is the phenomenon that the method seems to be convergent regardless of the initialization and that (even more surprisingly) that the limit point seems to be independent of the starting point (and also seems to be robust with respect to overestimating the sparsity ${s}$). I wondered if his results are robust with respect to inexact projections. For larger problems the projection onto the equality constraint ${Ax=b}$ are computationally expensive. For example it would be interesting to see what happens if one approximates the projection with a truncated CG-iteration as Andreas, Marc and I did in our paper on subgradient methods for Basis Pursuit.

• Joel Tropp reported on his paper Sharp recovery bounds for convex deconvolution, with applications together with Michael McCoy. However, in his title he used demixing instead of deconvolution (which, I think, is more appropriate and leads to less confusion). With “demixing” they mean the following: Suppose you have two signals ${x_0}$ and ${y_0}$ of which you observe only the superposition of ${x_0}$ and a unitarily transformed ${y_0}$, i.e. for a unitary matrix ${U}$ you observe

$\displaystyle z_0 = x_0 + Uy_0.$

Of course, without further assumptions there is no way to recover ${x_0}$ and ${y_0}$ from the knowledge of ${z_0}$ and ${U}$. As one motivation he used the assumption that both ${x_0}$ and ${y_0}$ are sparse. After the big bang of compressed sensing nobody wonders that one turns to convex optimization with ${\ell^1}$-norms in the following manner:

$\displaystyle \min_{x,y} \|x\|_1 + \lambda\|y\|_1 \ \text{such that}\ x + Uy = z_0. \ \ \ \ \ (1)$

This looks a lot like sparse approximation: Eliminating ${x}$ one obtains the unconstraint problem \begin{equation*} \min_y \|z_0-Uy\|_1 + \lambda \|y\|_1. \end{equation*}

Phrased differently, this problem aims at finding an approximate sparse solution of ${Uy=z_0}$ such that the residual (could also say “noise”) ${z_0-Uy=x}$ is also sparse. This differs from the common Basis Pursuit Denoising (BPDN) by the structure function for the residual (which is the squared ${2}$-norm). This is due to the fact that in BPDN one usually assumes Gaussian noise which naturally lead to the squared ${2}$-norm. Well, one man’s noise is the other man’s signal, as we see here. Tropp and McCoy obtained very sharp thresholds on the sparsity of ${x_0}$ and ${y_0}$ which allow for exact recovery of both of them by solving (1). One thing which makes their analysis simpler is the following reformulation: The treated the related problem \begin{equation*} \min_{x,y} \|x\|_1 \text{such that} \|y\|_1\leq\alpha, x+Uy=z_0 \end{equation*} (which I would call this the Ivanov version of the Tikhonov-problem (1)). This allows for precise exploitation of prior knowledge by assuming that the number ${\alpha_0 = \|y_0\|_1}$ is known.

First I wondered if this reformulation was responsible for their unusual sharp results (sharper the results for exact recovery by BDPN), but I think it’s not. I think this is due to the fact that they have this strong assumption on the “residual”, namely that it is sparse. This can be formulated with the help of ${1}$-norm (which is “non-smooth”) in contrast to the smooth ${2}$-norm which is what one gets as prior for Gaussian noise. Moreover, McCoy and Tropp generalized their result to the case in which the structure of ${x_0}$ and ${y_0}$ is formulated by two functionals ${f}$ and ${g}$, respectively. Assuming a kind of non-smoothness of ${f}$ and ${g}$ the obtain the same kind of results and especially matrix decomposition problems are covered.

The second day of ISMP started (for me) with the session I organized and chaired.

The first talk was by Michael Goldman on Continuous Primal-Dual Methods in Image Processing. He considered the continuous Arrow-Hurwitz method for saddle point problems

$\displaystyle \min_{u}\max_{\xi} K(u,\xi)$

with ${K}$ convex in the first and concave in the second variable. The continuous Arrow-Hurwitz method consists of solving

$\displaystyle \begin{array}{rcl} \partial_t u(t) &=& -\nabla_u K(u(t),\xi(t))\\ \partial_t \xi(t) &=& \nabla_\xi K(u(t),\xi(t)). \end{array}$

His talk evolved around the problem if ${K}$ comes from a functional which contains the total variation, namely he considered

$\displaystyle K(u,\xi) = -\int_\Omega u\text{div}(\xi) + G(u)$

with the additional constraints ${\xi\in C^1_C(\Omega,{\mathbb R}^2)}$ and ${|\xi|\leq 1}$. For the case of ${G(u) = \lambda\|u-f\|^2/2}$ he presented a nice analysis of the problem including convergence of the method to a solution of the primal problem and some a-posteriori estimates. This reminded me of Showalters method for the regularization of ill-posed problems. The Arrow-Hurwitz method looks like a regularized version of Showalters method and hence, early stopping does not seem to be necessary for regularization. The related paper is Continuous Primal-Dual Methods for Image Processing.

The second talk was given by Elias Helou and was on Incremental Algorithms for Convex Non-Smooth Optimization with Applications in Image Reconstructions. He presented his work on a very general framework for problems of the class

$\displaystyle \min_{x\in X} f(x)$

with a convex function ${f}$ and a convex set ${X}$. Basically, he abstracted the properties of the projected subgradient method. This consists of taking subgradient descent steps for ${f}$ followed by projection onto ${X}$ iteratively: With a subgradient ${g^n\in\partial f(x^n)}$ this reads as

$\displaystyle x^{n+1} = P_X(x^n -\alpha_n g^n)$

he extracted the conditions one needs from the subgradient descent step and from the projection step and formulated an algorithm which consists of successive application of an “optimality operator” ${\mathcal{O}_f}$ (replacing the subgradient step) and a feasibility operator ${\mathcal{F}_X}$ (replacing the projection step). The algorithm then reads as

$\displaystyle \begin{array}{rcl} x^{n+1/2} &=& \mathcal{O}_f(x^n,\alpha_n)\\ x^{n+1} &=& \mathcal{F}_x(x^{n+1/2} \end{array}$

and he showed convergence under the extracted conditions. The related paper is , Incremental Subgradients for Constrained Convex Optimization: a Unified Framework and New Methods.

The third talk was by Jerome Fehrenbach on Stripes removel in images, apllications in microscopy. He considered the problem of very specific noise which is appear in the form of stripes (and appears, for example, “single plane illumination microscopy”). In fact he considered a little more general case and the model he proposed was as follows: The observed image is

$\displaystyle u_{\text{OBS}} = u + n,$

i.e. the usual sum of the true image ${u}$ and noise ${n}$. However, for the noise he assumed that it is given by

$\displaystyle n = \sum_{j=1}^m \psi_j*\lambda_j,$

i.e. it is a sum of different convolutions. The ${\psi_j}$ are kind of shape-functions which describe the “pattern of the noise” and the ${\lambda_j}$ are samples of noise processes, following specific distributions (could be white noise realizations, impulsive noise or something else)-. He then formulated a variational method to identify the variables ${\lambda_j}$ which reads as

$\displaystyle \min \|\nabla(u_{\text{OBS}} - \sum_{j=1}^m \psi_j*\lambda_j)\|_1 + \sum_j \phi_j(\lambda_j).$

Basically, this is the usual variational approach to image denoising, but nor the optimization variable is the noise rather than the image. This is due to the fact that the noise has a specific complicated structure and the usual formulation with ${u = u_{\text{OBS}} +n}$ is not feasible. He used the primal-dual algorithm by Chambolle and Pock for this problem and showed that the method works well on real world problems.

Another theme which caught my attention here is “optimization with variational inequalities as constraints”. At first glance that sounds pretty awkward. Variational inequalities can be quite complicated things and why on earth would somebody considers these things as side conditions in optimization problems? In fact there are good reasons to do so. One reason is, if you have to deal with bi-level optimization problems. Consider an optimization problem

$\displaystyle \min_{x\in C} F(x,p) \ \ \ \ \ (1)$

with convex ${C}$ and ${F(\cdot,p)}$ (omitting regularity conditions which could be necessary to impose) depending on a parameter ${p}$. Now consider the case that you want to choose the parameter ${p}$ in an optimal way, i.e. it solves another optimization problem. This could look like

$\displaystyle \min_p G(x)\quad\text{s.t.}\ x\ \text{solves (1)}. \ \ \ \ \ (2)$

Now you have an optimization problem as a constraint. Now we use the optimality condition for the problem~(1): For differentiable ${F}$, ${x^*}$ solves~(1) if and only if

$\displaystyle \forall y\in C:\ \nabla_x F(x^*(p),p)(y-x^*(p))\geq 0.$

In other words: We con reformulate (2) as

$\displaystyle \min_p G(x)\quad\text{s.t.}\ \forall y\in C:\ \nabla_x F(x^*(p),p)(y-x^*(p))\geq 0. \ \ \ \ \ (3)$

And there it is, our optimization problem with a variational inequality as constraint. Here at ISMP there are entire sessions devoted to this, see here and here.

The scientific program at ISMP started today and I planned to write a small personal summary of each day. However, it is a very intense meeting. Lot’s of excellent talks, lot’s of people to meet and little spare time. So I’m afraid that I have to deviate from my plan a little bit. Instead of a summary of every day I just pick out a few events. I remark that these picks do not reflect quality, significance or something like this in any way. I just pick things for which I have something to record for personal reasons.

My day started after the first plenary which the session Testing environments for machine learning and compressed sensing in which my own talk was located. The session started with the talk by Michael Friedlander of the SPOT toolbox. Haven’t heard of SPOT yet? Take a look! In a nutshell its a toolbox which turns MATLAB into “OPLAB”, i.e. it allows to treat abstract linear operators like matrices. By the way, the code is on github.

The second talk was by Katya Scheinberg (who is giving a semi-planary talk on derivative free optimization at the moment…). She talked about speeding up FISTA by cleverly adjusting step-sizes and over-relaxation parameters and generalizing these ideas to other methods like alternating direction methods. Notably, she used the “SPEAR test instances” from our project homepage! (And credited them as “surprisingly hard sparsity problems”.)

My own talk was the third and last one in that session. I talked about the issue of constructing test instance for Basis Pursuit Denoising. I argued that the naive approach (which takes a matrix ${A}$, a right hand side ${b}$ and a parameter ${\lambda}$ and let some great solver run for a while to obtain a solution ${x^*}$) may suffer from “trusted method bias”. I proposed to use “reverse instance construction” which is: First choose ${A}$, ${\lambda}$ and the solution ${x^*}$ and the construct the right hand side ${b}$ (I blogged on this before here).

Last but not least, I’d like to mention the talk by Thomas Pock: He talked about parameter selection on variational models (think of the regularization parameter in Tikhonov, for example). In a paper with Karl Kunisch titled A bilevel optimization approach for parameter learning in variational models they formulated this as a bi-level optimization problem. An approach which seemed to have been overdue! Although they treat somehow simple inverse problems (well, denoising) (but with not so easy regularizers) it is a promising first step in this direction.

Today I arrived at ISMP in Berlin. This seems to be the largest conference on optimization and is hosted by the Mathematical Optimization Society. (As a side note: The society recently changed its name from MPS (Mathematical Programming Society) to MOS. Probably the conference will be called ISMO in a few years…).

The reception today was special in comparison to conference receptions I have attended so far. First, it was held in the Konzerthaus which is a pretty fancy neo-classical building. Well, I’ve been to equally fancy buildings at conference receptions at GAMM or AIP conferences already, but the distinguishing feature this evening was the program. As usual it featured welcome notes by important people (notably, the one by the official government representative was accurate and entertaining!), prizes and music. The music was great, the host (G.M. Ziegler) did a great job and the ceremony felt like a show rather than an opening reception.

From the prices I’d like to mention two:

After this reception I am looking even more forward to the rest of this conference.

As I a side note: Something seems to be wrong with me and optimization conferences. It seems like every time I visit such a conference, I am loosing my cell-phone. Happened to me at SIOPT 2011 in Darmstadt and happened to me again today…