Conference


Currently I am at the SIAM Imaging conference in Hong Kong. It’s a great conference with great people at a great place. I am pretty sure that this will be the only post from here, since the conference is quite intense. I just wanted to report on two ideas that have become clear here, although, they are both pretty easy and probably already widely known, but anyway:

1. Non-convex + convex objective

There are a lot of talks that deal with optimization problems of the form

\displaystyle  \min_u F(u) + G(u).

Especially, people try to leverage as much structure of the functionals {F} and {G} as possible. Frequently, there arises a need to deal with non-convex parts of the objective, and indeed, there are several approaches around that deal in one way or another with non-convexity of {F} or even {F+G}. Usually, in the presence of an {F} that is not convex, it is helpful if {G} has favorable properties, e.g. that still {F+G} is bounded from below, coercive or even convex again. A particularly helpful property is strong convexity of {G} (i.e. {G} stays convex even if you subtract {\epsilon/2\|\cdot\|^2} from it). Here comes the simple idea: If you already allow {F} to be non-convex, but only have a {G} that is merely convex, but not strongly so, you can modify your objective to

\displaystyle  \underbrace{F(u) - \tfrac\epsilon2\|u\|^2}_{\leftarrow F(u)} + \underbrace{G(u) + \tfrac\epsilon2\|u\|^2}_{\leftarrow G(u)}

for some {\epsilon>0}. This will give you strong convexity of {G} and an {F} that is (often) theoretically no worse than it used to be. It appeared to me that this is an idea that Kristian Bredies told me already almost ten years ago and which me made into a paper (together with Peter Maaß) in 2005 which got somehow delayed and published no earlier than 2009.

2. Convex-concave saddle point problems

If your problem has the form

\displaystyle  \min_u F(u) + G(Ku)

with some linear operator {K} and both {F} and {G} are convex, it has turned out, that it is tremendously helpful for the solution to consider the corresponding saddle point formulation: I.e. using the convex conjugate {G^*} of {G}, you write

\displaystyle  \min_u \max_v F(u) + \langle Ku, v\rangle -G^*(v).

A class of algorithms, that looks like to Arrow-Hurwicz-method at first glance, has been sparked be the method proposed by Chambolle and Pock. This method allows {F} and {G} to be merely convex (no smoothness or strong convexity needed) and only needs the proximal operators for both {F} and {G^*}. I also worked on algorithms for slightly more general problems, involving a reformulation of the saddle point problem as a monotone inclusion, with Tom Pock in the paper An accelerated forward-backward algorithm for monotone inclusions and I also should mention this nice approach by Bredies and Sun who consider another reformulation of the monotone inclusion. However, in the spirit of the first point, one should take advantage of all the available structure in the problem, e.g. smoothness of one of the terms. Some algorithm can exploit smoothness of either {F} or {G^*} and only need convexity of the other term. An idea, that has been used for some time already, to tackle the case if {F}, say, is a sum of a smooth part and a non-smooth part (and {G^*} is not smooth), is, to dualize the non-smooth part of {F}: Say we have {F = F_1 + F_2} with smooth {F_1}, then you could write

\displaystyle  \begin{array}{rcl}  &\min_u\max_v F_1(u) + F_2(u) + \langle Ku, v\rangle -G^*(v)\\ & \qquad= \max_u \min_{v,w} F_1(u) + \langle u,w\rangle + \langle Ku, v\rangle -G^*(v) - F_2^*(w) \end{array}

and you are back in business, if your method allows for sums of convex functions in the dual. The trick got the sticky name “dual transportation trick” in a talk by Marc Teboulle here and probably that will help, that I will not forget it from now on…

The mother example of optimization is to solve problems

\displaystyle  \min_{x\in C} f(x)

for functions {f:{\mathbb R}^n\rightarrow{\mathbb R}} and sets {C\in{\mathbb R}^n}. One further classifies problems according to additional properties of {f} and {C}: If {C={\mathbb R}^n} one speaks of unconstrained optimization, if {f} is smooth one speaks of smooth optimization, if {f} and {C} are convex one speaks of convex optimization and so on.

1. Classification, goals and accuracy

Usually, optimization problems do not have a closed form solution. Consequently, optimization is not primarily concerned with calculating solutions to optimization problems, but with algorithms to solve them. However, having a convergent or terminating algorithm is not fully satisfactory without knowing an upper bound on the runtime. There are several concepts one can work with in this respect and one is the iteration complexity. Here, one gives an upper bound on the number of iterations (which are only allowed to use certain operations such as evaluations of the function {f}, its gradient {\nabla f}, its Hessian, solving linear systems of dimension {n}, projecting onto {C}, calculating halfspaces which contain {C}, or others) to reach a certain accuracy. But also for the notion of accuracy there are several definitions:

  • For general problems one can of course desire to be within a certain distance to the optimal point {x^*}, i.e. {\|x-x^*\|\leq \epsilon} for the solution {x^*} and a given point {x}.
  • One could also demand that one wants to be at a point which has a function value close to the optimal one {f^*}, i.e, {f(x) - f^*\leq \epsilon}. Note that for this and for the first point one could also desire relative accuracy.
  • For convex and unconstrained problems, one knowns that the inclusion {0\in\partial f(x^*)} (with the subgradient {\partial f(x)}) characterizes the minimizers and hence, accuracy can be defined by desiring that {\min\{\|\xi\|\ :\ \xi\in\partial f(x)\}\leq \epsilon}.

It turns out that the first two definitions of accuracy are much to hard to obtain for general problems and even for smooth and unconstrained problems. The main issue is that for general functions one can not decide if a local minimizer is also a solution (i.e. a global minimizer) by only considering local quantities. Hence, one resorts to different notions of accuracy, e.g.

  • For a smooth, unconstrained problems aim at stationary points, i.e. find {x} such that {\|\nabla f(x)\|\leq \epsilon}.
  • For smoothly constrained smooth problems aim at “approximately KKT-points” i.e. a point that satisfies the Karush-Kuhn-Tucker conditions approximately.

(There are adaptions to the nonsmooth case that are in the same spirit.) Hence, it would be more honest not write {\min_x f(x)} in these cases since this is often not really the problem one is interested in. However, people write “solve {\min_x f(x)}” all the time even if they only want to find “approximately stationary points”.

2. The gradient method for smooth, unconstrainted optimization

Consider a smooth function {f:{\mathbb R}^n\rightarrow {\mathbb R}} (we’ll say more precisely how smooth in a minute). We make no assumption on convexity and hence, we are only interested in finding stationary points. From calculus in several dimensions it is known that {-\nabla f(x)} is a direction of descent from the point {x}, i.e. there is a value {h>0} such that {f(x - h\nabla f(x))< f(x)}. Hence, it seems like moving into the direction of the negative gradient is a good idea. We arrive at what is known as gradient method:

\displaystyle  x_{k+1} = x_k - h_k \nabla f(x_k).

Now let’s be more specific about the smoothness of {f}. Of course we need that {f} is differentiable and we also want the gradient to be continuous (to make the evaluation of {\nabla f} stable). It turns out that some more smoothness makes the gradient method more efficient, namely we require that the gradient of {f} is Lipschitz continuous with a known Lipschitz constant {L}. The Lipschitz constant can be used to produce efficient stepsizes {h_k}, namely, for {h_k = 1/L} one has the estimate

\displaystyle  f(x_k) - f(x_{k+1})\geq \frac{1}{2L}\|\nabla f(x_k)\|^2.

This inequality is really great because one can use telescoping to arrive at

\displaystyle  \frac{1}{2L}\sum_{k=0}^N \|\nabla f(x_k)\|^2 \leq f(x_0) - f(x_{N+1}) \leq f(x_0) - f^*

with the optimal value {f} (note that we do not need to know {f^*} for the following). We immediately arrive at

\displaystyle  \min_{0\leq k\leq N} \|\nabla f(x_k)\| \leq \frac{1}{\sqrt{N+1}}\sqrt{2L(f(x_0)-f^*))}.

That’s already a result on the iteration complexity! Among the first {N} iterates there is one which has a gradient norm of order {N^{-1/2}}.

However, from here on it’s getting complicated: We can not say anything about the function values {f(x_k)} and about convergence of the iterates {x_k}. And even for convex functions {f} (which allow for more estimates from above and below) one needs some more effort to prove convergence of the functional values to the global minimal one.

But how about convergence of the iterates for the gradient method if convexity is not given? It turns out that this is a hard problem. As illustration, consider the continuous case, i.e. a trajectory of the dynamical system

\displaystyle  \dot x = -\nabla f(x)

(which is a continuous limit of the gradient method as the stepsize goes to zero). A physical intuition about this dynamical system in {{\mathbb R}^2} is as follows: The function {f} describes a landscape and {x} are the coordinates of an object. Now, if the landscape is slippery the object slides down the landscape and if we omit friction and inertia, the object will always slide in the direction of the negative gradient. Consider now a favorable situation: {f} is smooth, bounded from below and the level sets {\{f\leq t\}} are compact. What can one say about the trajectories of the {\dot x = -\nabla f(x)}? Well, it seems clear that one will arrive at a local minimum after some time. But with a little imagination one can see that the trajectory of {x} does not even has to be of finite length! To see this consider a landscape {f} that is a kind of bowl-shaped valley with a path which goes down the hillside in a spiral way such that it winds around the minimum infinitely often. This situation seems somewhat pathological and one usually does not expect situation like this in practice. If you tried to prove convergence of the iterates of gradient or subgradient descent you may have noticed that one sometimes wonders why the proof turns out to be so complicated. The reason lies in the fact that such pathological functions are not excluded. But what functions should be excluded in order to avoid this pathological behavior without restricting to too simple functions?

3. The Kurdyka-Łojasiewicz inequality

Here comes the so-called Kurdyka-Łojasiewicz inequality into play. I do not know its history well, but if you want a pointer, you could start with the paper “On gradients of functions definable in o-minimal structures” by Kurdyka.

The inequality shall be a way to turn a complexity estimate for the gradient of a function into a complexity estimate for the function values. Hence, one would like to control the difference in functional value by the gradient. One way to do so is the following:

Definition 1 Let {f} be a real valued function and assume (without loss of generality) that {f} has a unique minimum at {0} and that {f(0)=0}. Then {f} satisfies a Kurdyka-Łojasiewicz inequality if there exists a differentiable function {\kappa:[0,r]\rightarrow {\mathbb R}} on some interval {[0,r]} with {\kappa'>0} and {\kappa(0)=0} such that

\displaystyle  \|\nabla(\kappa\circ f)(x)\|\geq 1

for all {x} such that {f(x)<r}.

Informally, this definition ensures that that one can “reparameterize the range of the function such that the resulting function has a kink in the minimum and is steep around that minimum”. This definition is due to the above paper by Kurdyka from 1998. In fact it is a slight generalization of the Łowasiewicz inequality (which dates back to a note of Łojasiewicz from 1963) which states that there is some {C>0} and some exponent {\theta} such that in the above situation it holds that

\displaystyle  \|\nabla f(x)\|\geq C|f(x)|^\theta.

To that that, take {\kappa(s) = s^{1-\theta}} and evaluate the gradient to {\nabla(\kappa\circ f)(x) = (1-\theta)f(x)^{-\theta}\nabla f(x)} to obtain {1\leq (1-\theta)|f(x)|^{-\theta}\|\nabla f(x)\|}. This also makes clear that in the case the inequality is fulfilled, the gradient provides control over the function values.

The works of Łojasiewicz and Kurdyka show that a large class of functions {f} fulfill the respective inequalities, e.g. piecewise analytic function and even a larger class (termed o-minimal structures) which I haven’t fully understood yet. Since the Kurdyka-Łojasiewicz inequality allows to turn estimates from {\|\nabla f(x_k)\|} into estimates of {|f(x_k)|} it plays a key role in the analysis of descent methods. It somehow explains, that one really never sees pathological behavior such as infinite minimization paths in practice. Lately there have been several works on further generalization of the Kurdyka-Łojasiewicz inequality to the non-smooth case, see e.g. Characterizations of Lojasiewicz inequalities: subgradient flows, talweg, convexity by Bolte, Daniilidis, Ley and Mazet Convergence of non-smooth descent methods using the Kurdyka-Łojasiewicz inequality by Noll (however, I do not try to give an overview over the latest developments here). Especially, here at the French-German-Polish Conference on Optimization which takes place these days in Krakow, the Kurdyka-Łojasiewicz inequality has popped up several times.

The second day of SSVM started with an invited lecture of Tony Lindeberg, who has written one very influential and very early book about scale space theory. His talk was both a tour through scale space and a recap of the recent developments in the field. Especially he show how the time aspect could be incorporated into scale space analysis by a close inspection of how receptive fiels are working. There were more talks but I only took notes from the talk of Jan Lellmann who talked about the problem of generating an elevation map from a few given level lines. One application of this could be to observe coastlines at different tides and then trying the reconstruct the full height map at the coast. One specific feature here is that the surface one looks for may have ridges which stem from kinks in the level lines and these ridges are important features of the surface. He argued that a pure convex regularization will not work and proposed to use more input namely a vector field which is derived from the contour lines such that the vector somehow “follows the ridges”, i.e. it connects to level lines in a correct way.

Finally another observation I had today: Well, this is not a trend, but a notion which I heard for the first time here but which sounds very natural is the informal classification of data terms in variational models as “weak” or “strong”. For example, a denoising data term {\|u-u^0\|^2_{L^2(\Omega)}} is a strong data term because it gives tight information on the whole set {\Omega}. On the other hand, an inpainting data term {u|_{\Omega\setminus\Omega'} = u^0|_{\Omega\setminus\Omega'}} is a weak data term because it basically tell nothing within the region {\Omega'}.

For afternoon the whole conference has been on tour to three amazing places:

  • the Riegersburg, which is not only an impressive castle but also features interesting exhibitions about old arm and witches,
  • the Zotter chocolate factory where they make amazing chocolate in mind-boggling varieties,
  • and to Schloss Kronberg for the conference dinner (although it was pretty tough to start eating the dinner after visiting Zotter…).

I am at the SSVM 13 and the first day is over. The conference is in a very cozy place in Austria, namely the Schloss Seggau which is located on a small hill near the small town Leibnitz in Styria (which is in Austria but totally unaffected by the floods). The conference is also not very large but there is a good crowd of interesting people here and the program reads interesting.

Schloss Seggau

The time schedule is tight and I don’t know if I can make it to post something everyday. Especially, I will not be able to blog about every talk.

From the first day I have the impression that two things have appeared frequently at different places:

  • Differential geometry: Tools from differential geometry like the notion of geodesics or Riemannian metrics have not only been used by Martin Rumpf to model different types of spaces of shapes but also to interpolate between textures.
  • Primal dual methods: Variational models should be convex these days and if they are not you should make them convex somehow. Then you can use primal dual methods. Somehow the magic of primal dual methods is that they are incredibly flexible. Also they work robustly and reasonably fast.

Probably, there are more trends to see in the next days.

I case anybody wondered: I did not make it to produce a single post from Oberwolfach last week. For me it was an unusual dense week. Not only that there were many interesting talks on a tight schedule. The “talk-free” parts of the days I had a lot of interesting and stimulating discussions and the night life did not permit the writing posts either. Sorry – but you may probably read about the outcome of some of the discussions in future blog post…

Guess where this is:

Well, I spoiled the solution in the title anyway (and of course, some of you recognized the sculpture in the last picture): I am in Oberwolfach this week at the workshop “Computational Inverse Problems”. Maybe I’ll find some time to blog about some talks but I am not sure if I can make it. Usually the Oberwolfach workshops are quite intense and hence, blogging will be too much…

Anyway, the autumn looks great here these days:

ISMP LogoISMP is over now and I’m already home. I do not have many things to report on from the last day. This is not due the lower quality of the talks but due to the fact that I was a little bit exhausted, as usual at the end of a five-day conference. However, I collect a few things for the record:

  • In the morning I visited the semi-planary by Xiaojun Chenon non-convex and non-smooth minimization with smoothing methods. Not surprisingly, she treated the problem

    \displaystyle \min_x f(x) + \|x\|_p^p

    with convex and smooth {f:{\mathbb R}^n\rightarrow{\mathbb R}} and {0<p<1}. She proposed and analyzed smoothing methods, that is, to smooth the problem a bit to obtain a Lipschitz-continuous objective function {\phi_\epsilon}, minimizing this and then gradually decreasing {\epsilon}. This works, as she showed. If I remember correctly, she also treated “iteratively reweighted least squares” as I described in my previous post. Unfortunately, she did not include the generalized forward-backward methods based on {\text{prox}}-functions for non-convex functions. Kristian and I pursued this approach in our paper Minimization of non-smooth, non-convex functionals by iterative thresholding and some special features of our analysis include:

    • A condition which excludes some (but not all) local minimizers from being global.
    • An algorithm which avoids this non-global minimizers by carefully adjusting the steplength of the method.
    • A result that the number of local minimizers is still finite, even if the problem is posed in {\ell^2({\mathbb N})} and not in {{\mathbb R}^n}.

    Most of our results hold true, if the {p}-quasi-norm is replaced by functions of the form

    \displaystyle \sum_n \phi_n(|x_n|)

    with special non-convex {\phi}, namely fulfilling a list of assumptions like

    • {\phi'(x) \rightarrow \infty} for {x\rightarrow 0} (infinite slope at {0}) and {\phi(x)\rightarrow\infty} for {x\rightarrow\infty} (mild coercivity),
    • {\phi'} strictly convex on {]0,\infty[} and {\phi'(x)/x\rightarrow 0} for {x\rightarrow\infty},
    • for each {b>0} there is {a>0} such that for {x<b} it holds that {\phi(x)>ax^2}, and
    • local integrability of some section of {\partial\phi'(x) x}.

    As one easily sees, {p}-quasi-norms fulfill the assumptions and some other interesting functions as well (e.g. some with very steep slope at {0} like {x\mapsto \log(x^{1/3}+1)}).

  • Jorge Nocedalgave a talk on second-order methods for non-smooth problems and his main example was a functional like

    \displaystyle \min_x f(x) + \|x\|_1

    with a convex and smooth {f}, but different from Xiaojun Chen, he only considered the {1}-norm. His talked is among the best planary talks I have ever attended and it was a great pleasure to listen to him. He carefully explained things and put them in perspective. In the case he skipped slides, he made me feel that I either did not miss an important thing, or understood them even though he didn’t show them He argued that it is not necessarily more expensive to use second order information in contrast to first order methods. Indeed, the {1}-norm can be used to reduce the number of degrees of freedom for a second order step. What was pretty interesting is, that he advocated semismooth Newton methods for this problem. Roland and I pursued this approach some time ago in our paper A Semismooth Newton Method for Tikhonov Functionals with Sparsity Constraints and, if I remember correctly (my notes are not complete at this point), his family of methods included our ssn-method. The method Roland and I proposed worked amazingly well in the cases in which it converged but the method suffered from non-global convergence. We had some preliminary ideas for globalization, which we could not tune enough to retain the speed of the method, and abandoned the topic. Now, that the topic will most probably be revived by the community, I am looking forward to fresh ideas here.

Today there are several things I could blog on. The first is the planary by Rich Baraniuk on Compressed Sensing. However, I don’t think that I could reflect the content in a way which would be helpful for a potential reader. Just for the record: If you have the chance to visit one of Rich’s talk: Do it!

The second thing is the talk by Bernd Hofmann on source conditions, smoothness and variational inequalities and their use in regularization of inverse problems. However, this would be too technical for now and I just did not take enough notes to write a meaningful post.

As a third thing I have the talk by Christian Clason on inverse problems with uniformly distributed noise. He argued that for uniform noise it is much better to use an {L^\infty} discrepancy term instead of the usual {L^2}-one. He presented a path-following semismooth Newton method to solve the problem

\displaystyle \min_x \frac{1}{p}\|Kx-y^\delta\|_\infty^p + \frac{\alpha}{2}\|x\|_2^2

and showed examples with different kinds of noise. Indeed the examples showed that {L^\infty} works much better than {L^2} here. But in fact it works even better, if the noise is not uniformly distributed but “impulsive” i.e. it attains bounds {\pm\delta} almost everywhere. It seems to me that uniform noise would need a slightly different penalty but I don’t know which one – probably you do? Moreover, Christian presented the balancing principle to choose the regularization parameter (without knowledge about the noise level) and this was the first time I really got what it’s about. What one does here is, to choose {\alpha} such that (for some {\sigma>0} which only depends on {K}, but not on the noise)

\displaystyle \sigma\|Kx_\alpha^\delta-y^\delta\|_\infty = \frac{\alpha}{2}\|x_\alpha^\delta\|_2^2.

The rational behind this is, that the left hand side is monotonically non-decreasing in {\alpha}, while the right hand side is monotonically non-increasing. Hence, there should be some {\alpha} “in the middle” which make both somewhat equally large. Of course, we do neither want to “over-regularize” (which would usually “smooth too much”) nor to “under-regularize” (which would not eliminate noise). Hence, balancing seems to be a valid choice. From a practical point of view the balancing is also nice because one can use the fixed-point iteration

\displaystyle \alpha^{n+1} = 2\sigma\frac{\|Kx_{\alpha^n}^\delta - y^\delta\|_\infty}{\|x_{\alpha_n}^\delta\|_2^2}

which converges in a few number of iterations.

Then there was the talk by Esther Klann, but unfortunately, I was late so only heard the last half…

Last but not least we have the talk by Christiane Pöschl. If you are interested in Total-Variation-Denoising (TV denoising), then you probably have heard many times that “TV denoising preserves edges” (have a look at the Wikipedia page – it claims this twice). What Christiane showed (in a work with Vicent Caselles and M. Novaga) that this claim is not true in general but only for very special cases. In case of characteristic functions, the only functions for which the TV minimizer has sharp edges are these so-called calibrated sets, introduced by Caselles et el. Building on earlier works by Caselles and co-workers she calculated exact minimizers for TV denoising in the case that the image consists of characteristic functions of two convex sets or of a single star shaped domain, that is, for a given set B she calculated the solution of

\displaystyle \min_u\int (u - \chi_B)^2dx + \lambda \int|Du|.

This is not is as easy as it may sound. Even for the minimizer for a single convex set one has to make some effort. She presented a nice connection of the shape of the obtained level-sets with the morphological operators of closing and opening. With the help of this link she derived a methodology to obtain the exact TV denoising minimizer for all parameters. I do not have the images right now but be assured that most of the time, the minimizers do not have sharp edges all over the place. Even for simple geometries (like two rectangles touching in a corner) strange things happen and only very few sharp edges appear. I’ll keep you posted in case the paper comes out (or appears as a preprint).

Christiane has some nice images which make this much more clear:

For two circles edges are preserved if they are far enough away from each other. If they are close, the area “in between” them is filled and, moreover, obey this fuzzy boundary. I remember myself seeing effects like this in the output of TV-solvers and thinking “well, it seems that the algorithm is either not good or not converged yet – TV should output sharp edges!”.

 

For a star-shaped shape (well, actually a star) the output looks like this. The corners are not only rounded but also blurred and this is true both for the “outer” corners and the “inner” corners.

 

So, if you have any TV-minimizing code, go ahead and check if your code actually does the right things on images like this!
Moreover, I would love to see similar results for more complicated extensions of TV like Total Generalized Variation, I treated here.

 

 

 

 

Today I report on two things I came across here at ISMP:

  • The first is a talk by Russell Luke on Constraint qualifications for nonconvex feasibility problems. Luke treated the NP-hard problem of sparsest solutions of linear systems. In fact he did not tackle this problem but the problem to find an {s}-sparse solution of an {m\times n} system of equations. He formulated this as a feasibility-problem (well, Heinz Bauschke was a collaborator) as follows: With the usual malpractice let us denote by {\|x\|_0} the number of non-zero entries of {x\in{\mathbb R}^n}. Then the problem of finding an {s}-sparse solution to {Ax=b} is:

    \displaystyle  \text{Find}\ x\ \text{in}\ \{\|x\|_0\leq s\}\cap\{Ax=b\}.

    In other words: find a feasible point, i.e. a point which lies in the intersection of the two sets. Well, most often feasibility problems involve convex sets but here, the first one given by this “{0}-norm” is definitely not convex. One of the simplest algorithms for the convex feasibility problem is to alternatingly project onto both sets. This algorithm dates back to von Neumann and has been analyzed in great detail. To make this method work for non-convex sets one only needs to know how to project onto both sets. For the case of the equality constraint {Ax=b} one can use numerical linear algebra to obtain the projection. The non-convex constraint on the number of non-zero entries is in fact even easier: For {x\in{\mathbb R}^n} the projection onto {\{\|x\|_0\leq s\}} consists of just keeping the {s} largest entries of {x} while setting the others to zero (known as the “best {s}-term approximation”). However, the theory breaks down in the case of non-convex sets. Russell treated problem in several papers (have a look at his publication page) and in the talk he focused on the problem of constraint qualification, i.e. what kind of regularity has to be imposed on the intersection of the two sets. He could shows that (local) linear convergence of the algorithm (which is observed numerically) can indeed be justified theoretically. One point which is still open is the phenomenon that the method seems to be convergent regardless of the initialization and that (even more surprisingly) that the limit point seems to be independent of the starting point (and also seems to be robust with respect to overestimating the sparsity {s}). I wondered if his results are robust with respect to inexact projections. For larger problems the projection onto the equality constraint {Ax=b} are computationally expensive. For example it would be interesting to see what happens if one approximates the projection with a truncated CG-iteration as Andreas, Marc and I did in our paper on subgradient methods for Basis Pursuit.

  • Joel Tropp reported on his paper Sharp recovery bounds for convex deconvolution, with applications together with Michael McCoy. However, in his title he used demixing instead of deconvolution (which, I think, is more appropriate and leads to less confusion). With “demixing” they mean the following: Suppose you have two signals {x_0} and {y_0} of which you observe only the superposition of {x_0} and a unitarily transformed {y_0}, i.e. for a unitary matrix {U} you observe

    \displaystyle  z_0 = x_0 + Uy_0.

    Of course, without further assumptions there is no way to recover {x_0} and {y_0} from the knowledge of {z_0} and {U}. As one motivation he used the assumption that both {x_0} and {y_0} are sparse. After the big bang of compressed sensing nobody wonders that one turns to convex optimization with {\ell^1}-norms in the following manner:

    \displaystyle   \min_{x,y} \|x\|_1 + \lambda\|y\|_1 \ \text{such that}\ x + Uy = z_0. \ \ \ \ \ (1)

    This looks a lot like sparse approximation: Eliminating {x} one obtains the unconstraint problem \begin{equation*} \min_y \|z_0-Uy\|_1 + \lambda \|y\|_1. \end{equation*}

    Phrased differently, this problem aims at finding an approximate sparse solution of {Uy=z_0} such that the residual (could also say “noise”) {z_0-Uy=x} is also sparse. This differs from the common Basis Pursuit Denoising (BPDN) by the structure function for the residual (which is the squared {2}-norm). This is due to the fact that in BPDN one usually assumes Gaussian noise which naturally lead to the squared {2}-norm. Well, one man’s noise is the other man’s signal, as we see here. Tropp and McCoy obtained very sharp thresholds on the sparsity of {x_0} and {y_0} which allow for exact recovery of both of them by solving (1). One thing which makes their analysis simpler is the following reformulation: The treated the related problem \begin{equation*} \min_{x,y} \|x\|_1 \text{such that} \|y\|_1\leq\alpha, x+Uy=z_0 \end{equation*} (which I would call this the Ivanov version of the Tikhonov-problem (1)). This allows for precise exploitation of prior knowledge by assuming that the number {\alpha_0 = \|y_0\|_1} is known.

    First I wondered if this reformulation was responsible for their unusual sharp results (sharper the results for exact recovery by BDPN), but I think it’s not. I think this is due to the fact that they have this strong assumption on the “residual”, namely that it is sparse. This can be formulated with the help of {1}-norm (which is “non-smooth”) in contrast to the smooth {2}-norm which is what one gets as prior for Gaussian noise. Moreover, McCoy and Tropp generalized their result to the case in which the structure of {x_0} and {y_0} is formulated by two functionals {f} and {g}, respectively. Assuming a kind of non-smoothness of {f} and {g} the obtain the same kind of results and especially matrix decomposition problems are covered.

The second day of ISMP started (for me) with the session I organized and chaired.

The first talk was by Michael Goldman on Continuous Primal-Dual Methods in Image Processing. He considered the continuous Arrow-Hurwitz method for saddle point problems

\displaystyle \min_{u}\max_{\xi} K(u,\xi)

with {K} convex in the first and concave in the second variable. The continuous Arrow-Hurwitz method consists of solving

\displaystyle \begin{array}{rcl} \partial_t u(t) &=& -\nabla_u K(u(t),\xi(t))\\ \partial_t \xi(t) &=& \nabla_\xi K(u(t),\xi(t)). \end{array}

His talk evolved around the problem if {K} comes from a functional which contains the total variation, namely he considered

\displaystyle K(u,\xi) = -\int_\Omega u\text{div}(\xi) + G(u)

with the additional constraints {\xi\in C^1_C(\Omega,{\mathbb R}^2)} and {|\xi|\leq 1}. For the case of {G(u) = \lambda\|u-f\|^2/2} he presented a nice analysis of the problem including convergence of the method to a solution of the primal problem and some a-posteriori estimates. This reminded me of Showalters method for the regularization of ill-posed problems. The Arrow-Hurwitz method looks like a regularized version of Showalters method and hence, early stopping does not seem to be necessary for regularization. The related paper is Continuous Primal-Dual Methods for Image Processing.

The second talk was given by Elias Helou and was on Incremental Algorithms for Convex Non-Smooth Optimization with Applications in Image Reconstructions. He presented his work on a very general framework for problems of the class

\displaystyle \min_{x\in X} f(x)

with a convex function {f} and a convex set {X}. Basically, he abstracted the properties of the projected subgradient method. This consists of taking subgradient descent steps for {f} followed by projection onto {X} iteratively: With a subgradient {g^n\in\partial f(x^n)} this reads as

\displaystyle x^{n+1} = P_X(x^n -\alpha_n g^n)

he extracted the conditions one needs from the subgradient descent step and from the projection step and formulated an algorithm which consists of successive application of an “optimality operator” {\mathcal{O}_f} (replacing the subgradient step) and a feasibility operator {\mathcal{F}_X} (replacing the projection step). The algorithm then reads as

\displaystyle \begin{array}{rcl} x^{n+1/2} &=& \mathcal{O}_f(x^n,\alpha_n)\\ x^{n+1} &=& \mathcal{F}_x(x^{n+1/2} \end{array}

and he showed convergence under the extracted conditions. The related paper is , Incremental Subgradients for Constrained Convex Optimization: a Unified Framework and New Methods.

The third talk was by Jerome Fehrenbach on Stripes removel in images, apllications in microscopy. He considered the problem of very specific noise which is appear in the form of stripes (and appears, for example, “single plane illumination microscopy”). In fact he considered a little more general case and the model he proposed was as follows: The observed image is

\displaystyle u_{\text{OBS}} = u + n,

i.e. the usual sum of the true image {u} and noise {n}. However, for the noise he assumed that it is given by

\displaystyle n = \sum_{j=1}^m \psi_j*\lambda_j,

i.e. it is a sum of different convolutions. The {\psi_j} are kind of shape-functions which describe the “pattern of the noise” and the {\lambda_j} are samples of noise processes, following specific distributions (could be white noise realizations, impulsive noise or something else)-. He then formulated a variational method to identify the variables {\lambda_j} which reads as

\displaystyle \min \|\nabla(u_{\text{OBS}} - \sum_{j=1}^m \psi_j*\lambda_j)\|_1 + \sum_j \phi_j(\lambda_j).

Basically, this is the usual variational approach to image denoising, but nor the optimization variable is the noise rather than the image. This is due to the fact that the noise has a specific complicated structure and the usual formulation with {u = u_{\text{OBS}} +n} is not feasible. He used the primal-dual algorithm by Chambolle and Pock for this problem and showed that the method works well on real world problems.

Another theme which caught my attention here is “optimization with variational inequalities as constraints”. At first glance that sounds pretty awkward. Variational inequalities can be quite complicated things and why on earth would somebody considers these things as side conditions in optimization problems? In fact there are good reasons to do so. One reason is, if you have to deal with bi-level optimization problems. Consider an optimization problem

\displaystyle \min_{x\in C} F(x,p) \ \ \ \ \ (1)

 

with convex {C} and {F(\cdot,p)} (omitting regularity conditions which could be necessary to impose) depending on a parameter {p}. Now consider the case that you want to choose the parameter {p} in an optimal way, i.e. it solves another optimization problem. This could look like

\displaystyle \min_p G(x)\quad\text{s.t.}\ x\ \text{solves (1)}. \ \ \ \ \ (2)

 

Now you have an optimization problem as a constraint. Now we use the optimality condition for the problem~(1): For differentiable {F}, {x^*} solves~(1) if and only if

\displaystyle \forall y\in C:\ \nabla_x F(x^*(p),p)(y-x^*(p))\geq 0.

In other words: We con reformulate (2) as

\displaystyle \min_p G(x)\quad\text{s.t.}\ \forall y\in C:\ \nabla_x F(x^*(p),p)(y-x^*(p))\geq 0. \ \ \ \ \ (3)

 

And there it is, our optimization problem with a variational inequality as constraint. Here at ISMP there are entire sessions devoted to this, see here and here.

Next Page »

Follow

Get every new post delivered to your Inbox.

Join 54 other followers