Math


The Douglas-Rachford method is a method to solve a monotone inclusion {0\in (A+B)x} with two maximally monotone operators {A,B} defined on a Hilbert space {X}. The method uses the resolvents {(I + \lambda A)^{-1}} and {(I + \lambda B)^{-1}} and produces two sequences of iterates

\displaystyle  \begin{array}{rcl}  x^{k+1}& =& (I + \lambda B)^{-1}(v^k)\\ v^{k+1} & = & v^k + (I+\lambda A)^{-1}(2x^{k+1} - v^k) -x^{k+1}. \end{array}

Looks pretty opaque to me and I did not have some good intuition where this methods comes from and why it should work. Here’s a way I can remember (and which is copied from “Preconditioned Douglas-Rachford Splitting Methods for Convex-Concave Saddle-Point Problems” by Hongpeng Sun and Kristian Bredies):

Substituting {w = Ax} gives the optimality system

\displaystyle  0 \in w + Bx,\qquad 0 \in -x + A^{-1} w

or, written differently

\displaystyle  0 \in \begin{bmatrix} B & I\\ -I & A^{-1} \end{bmatrix} \begin{bmatrix} x\\w \end{bmatrix}.

This is again a monotone inclusion, but now on {X\times X}. We introduce the positive definite operator

\displaystyle  M = \begin{bmatrix} I & -I\\ -I & I \end{bmatrix}

and perform the iteration

\displaystyle  (M + \begin{bmatrix} B & I\\ -I & A^{-1} \end{bmatrix}) \begin{bmatrix} x^{k+1}\\w^{k+1} \end{bmatrix} \ni M \begin{bmatrix} x^k\\w^k \end{bmatrix}.

(This is basically the same as applying the proximal point method to the preconditioned inclusion

\displaystyle  0\in M^{-1} \begin{bmatrix} B & I\\ -I & A^{-1} \end{bmatrix} \begin{bmatrix} x\\w \end{bmatrix}.)

Writing out the iteration gives

\displaystyle  \begin{array}{rcl}  x^{k+1} & = &(I + B)^{-1}(x^k - w^k)\\ w^{k+1} & = &(I + A^{-1})^{-1}(w^k + 2x^{k+1} - x^k). \end{array}

Now, applying the Moreau identity for monotone operators ({(I + A)^{-1} + (I+A^{-1})^{-1} = I}), gives

\displaystyle  \begin{array}{rcl}  x^{k+1} & = &(I + B)^{-1}(x^k - w^k)\\ w^{k+1} & = &w^k + 2x^k - x^k - (I + A)^{-1}(w^k + 2x^{k+1} - x^k) \end{array}

substituting {v^k = x^k - w^k} finally gives Douglas-Rachford:

\displaystyle  \begin{array}{rcl}  x^{k+1} & = &(I + B)^{-1}(v^k)\\ v^{k+1} & = & -x^{k+1} + v^k + (I + A)^{-1}(2x^{k+1} - v^k) \end{array}

(besides the stepsize {\lambda} which we would get by starting with the equivalent inclusion {0 \in \lambda(A+B)x} in the first place).

Probably the shortest derivation of Douglas-Rachford I have seen. Oh, and also the (weak) convergence proof comes for free: It’s a proximal point iteration and you just use the result by Rockafellar from “Monotone operators and the proximal point algorithm”, SIAM J. Control and Optimization 14(5), 1976.

Currently I am at the SIAM Imaging conference in Hong Kong. It’s a great conference with great people at a great place. I am pretty sure that this will be the only post from here, since the conference is quite intense. I just wanted to report on two ideas that have become clear here, although, they are both pretty easy and probably already widely known, but anyway:

1. Non-convex + convex objective

There are a lot of talks that deal with optimization problems of the form

\displaystyle  \min_u F(u) + G(u).

Especially, people try to leverage as much structure of the functionals {F} and {G} as possible. Frequently, there arises a need to deal with non-convex parts of the objective, and indeed, there are several approaches around that deal in one way or another with non-convexity of {F} or even {F+G}. Usually, in the presence of an {F} that is not convex, it is helpful if {G} has favorable properties, e.g. that still {F+G} is bounded from below, coercive or even convex again. A particularly helpful property is strong convexity of {G} (i.e. {G} stays convex even if you subtract {\epsilon/2\|\cdot\|^2} from it). Here comes the simple idea: If you already allow {F} to be non-convex, but only have a {G} that is merely convex, but not strongly so, you can modify your objective to

\displaystyle  \underbrace{F(u) - \tfrac\epsilon2\|u\|^2}_{\leftarrow F(u)} + \underbrace{G(u) + \tfrac\epsilon2\|u\|^2}_{\leftarrow G(u)}

for some {\epsilon>0}. This will give you strong convexity of {G} and an {F} that is (often) theoretically no worse than it used to be. It appeared to me that this is an idea that Kristian Bredies told me already almost ten years ago and which me made into a paper (together with Peter Maaß) in 2005 which got somehow delayed and published no earlier than 2009.

2. Convex-concave saddle point problems

If your problem has the form

\displaystyle  \min_u F(u) + G(Ku)

with some linear operator {K} and both {F} and {G} are convex, it has turned out, that it is tremendously helpful for the solution to consider the corresponding saddle point formulation: I.e. using the convex conjugate {G^*} of {G}, you write

\displaystyle  \min_u \max_v F(u) + \langle Ku, v\rangle -G^*(v).

A class of algorithms, that looks like to Arrow-Hurwicz-method at first glance, has been sparked be the method proposed by Chambolle and Pock. This method allows {F} and {G} to be merely convex (no smoothness or strong convexity needed) and only needs the proximal operators for both {F} and {G^*}. I also worked on algorithms for slightly more general problems, involving a reformulation of the saddle point problem as a monotone inclusion, with Tom Pock in the paper An accelerated forward-backward algorithm for monotone inclusions and I also should mention this nice approach by Bredies and Sun who consider another reformulation of the monotone inclusion. However, in the spirit of the first point, one should take advantage of all the available structure in the problem, e.g. smoothness of one of the terms. Some algorithm can exploit smoothness of either {F} or {G^*} and only need convexity of the other term. An idea, that has been used for some time already, to tackle the case if {F}, say, is a sum of a smooth part and a non-smooth part (and {G^*} is not smooth), is, to dualize the non-smooth part of {F}: Say we have {F = F_1 + F_2} with smooth {F_1}, then you could write

\displaystyle  \begin{array}{rcl}  &\min_u\max_v F_1(u) + F_2(u) + \langle Ku, v\rangle -G^*(v)\\ & \qquad= \max_u \min_{v,w} F_1(u) + \langle u,w\rangle + \langle Ku, v\rangle -G^*(v) - F_2^*(w) \end{array}

and you are back in business, if your method allows for sums of convex functions in the dual. The trick got the sticky name “dual transportation trick” in a talk by Marc Teboulle here and probably that will help, that I will not forget it from now on…

I fell a little bit behind on reporting on my new preprints. In this posts I’ll blog on two closely related ones; one of them already a bit old, the other one quite recent:

The papers are

As clear from the titles, both papers treat a similar method. The first paper contains all the theory and the second one has few particularly interesting applications.

In the first paper we propose to view several known algorithms such as the linearized Bregman method, the Kaczmarz method or the Landweber method from a different angle from which they all are special cases of another algorithm. To start with, consider a linear system

\displaystyle Ax=b

with {A\in{\mathbb R}^{m\times n}}. A fairly simple and old method to solve this, is the Landweber iteration which is

\displaystyle x^{k+1} = x^k - t_k A^T(Ax^k-b).

Obviously, this is nothing else than a gradient descent for the functional {\|Ax-b\|_2^2} and indeed converges to a minimizer of this functional (i.e. a least squares solution) if the stepsizes {t_k} fulfill {\epsilon\leq t_k\leq 2\|A\|^{-2} - \epsilon} for some {\epsilon>0}. If one initializes the method with {x^0=0} it converges to the least squares solution with minimal norm, i.e. to {A^\dag b} (with the pseudo-inverse {A^\dag}).

A totally different method is even older: The Kaczmarz method. Denoting by {a_k} the {k}-th row of {A} and {b_k} the {k}-th entry of {b} the method reads as

\displaystyle x^{k+1} = x^k - a_{r(k)}^T\frac{a_{r(k)}\cdot x^k - b_k}{\|a_{r(k)}\|_2^2}

where {r(k) = (k\mod m) +1} or any other “control sequence” that picks up every index infinitely often. This method also has a simple interpretation: Each equation {a_k\cdot x = b_k} describes a hyperplane in {{\mathbb R}^n}. The method does nothing else than projecting the iterates orthogonally onto the hyperplanes in an iterative manner. In the case that the system has a solution, the method converges to one, and if it is initialized with {x^0=0} we have again convergence to the minimum norm solution {A^\dag b}.

There is yet another method that solves {Ax=b} (but now it’s a bit more recent): The iteration produces two sequences of iterates

\displaystyle \begin{array}{rcl} z^{k+1} & = &z^k - t_k A^T(Ax^k - b)\\ x^{k+1} & = &S_\lambda(z^{k+1}) \end{array}

for some {\lambda>0}, the soft-thresholding function {S_\lambda(x) = \max(|x|-\lambda,0)\mathrm{sgn}(x)} and some stepsize {t_k}. For reasons I will not detail here, this is called the linearized Bregman method. It also converges to a solution of the system. The method is remarkably similar, but different from, the Landweber iteration (if the soft-thresholding function wouldn’t be there, both would be the same). It converges to the solution of {Ax=b} that has the minimum value for the functional {J(x) = \lambda\|x\|_1 + \tfrac12\|x\|_2^2}. Since this solution of close, and for {\lambda} large enough identical, to the minimum {\|\cdot\|_1} solution, the linearized Bregman method is a method for sparse reconstruction and applied in compressed sensing.

Now we put all three methods in a joint framework, and this is the framework of split feasibility problems (SFP). An SFP is a special case of a convex feasibility problems where one wants to find a point {x} in the intersection of multiple simple convex sets. In an SFP one has two different kinds of convex constraints (which I will call “simple” and “difficult” in the following):

  1. Constraints that just demand that {x\in C_i} for some convex sets {C_i}. I call these constraints “simple” because we assume that the projection onto each {C_i} is simple to obtain.
  2. Constraints that demand {A_ix\in Q_i} for some matrices {A_i} and simple convex sets {Q_i}. Although we assume that projections onto the {Q_i} are easy, these constraints are “difficult” because of the presence of the matrices {A_i}.

If there were only simple constraints a very basic method to solve the problem is the methods of alternating projections, also known as POCS (projection onto convex sets): Simply project onto all the sets {C_i} in an iterative manner. For difficult constraints, one can do the following: Construct a hyperplane {H_k} that separates the current iterate {x^k} from the set defined by the constraint {Ax\in Q} and project onto the hyperplane. Since projections onto hyperplanes are simple and since the hyperplane separates we move closer to the constraint set and this is a reasonable step to take. One such separating hyperplane is given as follows: For {x^k} compute {w^k = Ax^k-P_Q(Ax^k)} (with the orthogonal projection {P_Q}) and define

\displaystyle H_k = \{x\ : (A^Tw^k)^T\cdot x \leq (A^Tw^k)^T\cdot x^k - \|w^k\|_2^2\}.

Illustration of projections onto convex sets and separating hyperplanes

Illustration of projections onto convex sets and separating hyperplanes

Now we already can unite the Landweber iteration and the Kaczmarz method as follows: Consider the system {Ax=b} as a split feasibility problem in two different ways:

  1. Treat {Ax=b} as one single difficult constraint (i.e. set {Q=\{b\}}). Some calculations show that the above proposed method leads to the Landweber iteration (with a special stepsize).
  2. Treat {Ax=b} as {m} simple constraints {a_i\cdot x = b_i}. Again, some calculations show that this gives the Kaczmarz method.

Of course, one could also work “block-wise” and consider groups of equations as difficult constraints to obtain “block-Kaczmarz methods”.

Now comes the last twist: By adapting the term of “projection” one gets more methods. Particularly interesting is the notion of Bregman projections which comes from Bregman distances. I will not go into detail here, but Bregman distances are associated to convex functionals {J} and by replacing “projection onto {C_i} or hyperplanes” by respective Bregman projections, one gets another method for split feasibility problems. The two things I found remarkable:

  • The Bregman projection onto hyperplanes is pretty simple. To project some {x^k} onto the hyperplane {H = \{x\ :\ a^T\cdot x\leq \beta\}}, one needs a subgradient {z^k\in\partial J(x^k)} (in fact an “admissible one” but for that detail see the paper) and then performs

    \displaystyle x^{k+1} = \nabla J^*(z^k - t_k a)

    ({J^*} is the convex dual of {J}) with some appropriate stepsize {t_k} (which is the solution of a one-dimensional convex minimization problem). Moreover, {z^{k+1} = z^k - t_k a} is a new admissible subgradient at {x^{k+1}}.

  • If one has a problem with a constraint {Ax=b} (formulated as an SFP in one way or another) the method converges to the minimum-{J} solution of the equation if {J} is strongly convex.

Note that strong convexity of {J} implies differentiability of {J^*} and Lipschitz continuity of {\nabla J} and hence, the Bregman projection can indeed be carried out.

Now one already sees how this relates to the linearized Bregman method: Setting {J(x) = \lambda\|x\|_1 + \tfrac12\|x\|_2^2}, a little calculation shows that

\displaystyle \nabla J^*(z) = S_\lambda(z).

Hence, using the formulation with a “single difficult constraint” leads to the linearized Bregman method with a specific stepsize. It turns out that this stepsize is a pretty good one but also that one can show that a constant stepsize also works as long as it is positive and smaller that {2\|A\|^{-2}}.

In the paper we present several examples how one can use the framework. I see one strengths of this approach that one can add convex constraints to a given problem without getting into any trouble with the algorithmic framework.

The second paper extends a remark that we make in the first one: If one applies the framework of the linearized Bregman method to the case in which one considers the system {Ax=b} as {m} simple (hyperplane-)constraints one obtains a sparse Kaczmarz solver. Indeed one can use the simple iteration

\displaystyle \begin{array}{rcl} z^{k+1} & = &z^k - a_{r(k)}^T\frac{a_{r(k)}\cdot x^k - b_k}{\|a_{r(k)}\|_2^2}\\ x^{k+1} & = &S_\lambda(z^{k+1}) \end{array}

and will converge to the same sparse solution as the linearized Bregman method.

This method has a nice application to “online compressed sensing”: We illustrate this in the paper with an example from radio interferometry. There, large arrays of radio telescopes collect radio emissions from the sky. Each pair of telescopes lead to a single measurement of the Fourier transform of the quantity of interest. Hence, for {k} telescopes, each measurement gives {k(k-1)/2} samples in the Fourier domain. In our example we used data from the Very Large Array telescope which has 27 telescopes leading to 351 Fourier samples. That’s not much, if one want a picture of the emission with several ten thousands of pixels. But the good thing is that the Earth rotates (that’s good for several reasons): When the Earth rotates relative to the sky, the sampling pattern also rotates. Hence, one waits a small amount of time and makes another measurement. Commonly, this is done until the earth has made a half rotation, i.e. one complete measurement takes 12 hours. With the “online compressed sensing” framework we proposed, one can start reconstructing the image as soon the first measurements have arrived. Interestingly, one observes the following behavior: If one monitors the residual of the equation, it goes down during iterations and jumps up when new measurements arrive. But from some point on, the residual stays small! This says that the new measurements do not contradict the previous ones and more interestingly this happened precisely when the reconstruction error dropped down such that “exact reconstruction” in the sense of compressed sensing has happened. In the example of radio interferometry, this happened after 2.5 hours!

Reconstruction by online compressed sensing

Reconstruction by online compressed sensing

You can find slides of a talk I gave at the Sparse Tomo Days here.

It’s out again! Our department has a vacant position for optimization to fill! This time we are seeking a professor (W2) working in discrete optimization.

The official job advertisement has been sent to various newletters and digests and you can find it for example here or here. In addition to that information let me give some more information about the math department here. Basically, I copied the following information from this previous advertisement:

The math department here is a medium sized department. It covers quite broad range of mathematics:

  • Numerical Linear Algebra (Fassbender, Bollhöfer)
  • PDEs (Sonar, Hempel)
  • Modelling (Langemann)
  • Stochastics (Kreiss, Lindner, Leucht)
  • Applied Analysis/Mathematical Physics (Bach, myself)
  • Algebra and Discrete Mathematics (Eick, Löwen, Opolka)

and, of course, Optimization (Zimmermann, tba). In most cases I find some expert around for all my questions that are a bit outside my field. All groups are active and working together smoothly. The department is located in the Carl-Friedrich Gauss Faculty which is also the home of the departments for Computer Science, Business Administration and Social Sciences. At the least in Computer Science and Business Administration there are some mathematically oriented groups, e.g.

and there are several groups with some mathematical background and interesting fields of applications (computer graphics, robotics,…). Moreover, the TU has a lot of engineering institutes with strong background in mathematics and cool applications.
In addition to a lively and interesting research environment, the university treats its staff well (as far as I can see) and administrative burden or failures are not harming too much (in fact less then at other places, I’ve heard)!

Full disclosure: I am the head of the hiring committee this time. All questions you may have about the position can be sent to me.

The deadline for application is 30.04.2014. The deadline is sharp and only electronic applications (addressed to fk1@tu-bs.de) will be considered. Please send a single pdf-file and make sure that all text in the document is searchable.

I recently updated my working hardware and now use a tablet pc for work (namely a Nexus 10). In consequence, I also updated the software I used to have things more synchronized across devices. For my RSS feeds I now use feedly and the gReader app. However, I was not that happy with the method to store and mark paper I found but found the sharing interfaces between the apps pretty handy. I adopted the workflow that when I see a paper that I want to remember I sent them to my Evernote account where I tag them. Then, from time to time I go over the papers I marked and have a more detailed look. If I think, they deserve to be kept for future reference, they get a small entry here. Here’s the first take with just two papers from the last weeks (there are more in my backlog…):

On the convergence rate improvement of a primal-dual splitting algorithm for solving monotone inclusion problems by Radu Ioan Boţ, Ernö Robert Csetnek, André Heinrich, Christopher Hendrich (Math Prog): As first sight, I found this work pretty inaccessible but the title sounded interesting. I was a bit scared by the formula for the kind of problems they investigated: Solve the following inclusion for {x}

\displaystyle 0 \in z + Ax + \sum_{i=1}^m L_i^*((B_i\square D_i)(L_ix -r_i)) + Cx

where {A}, {B_i} and {D_i} are maximally monotone, {D_i} also {\nu_i} strongly monotone, {C} is {\eta}-coercive, {L_i} are linear and bounded and {\square} denotes the parallel sum, i.e. {A\square B = (A^{-1}+B^{-1})^{-1}}. Also the proposed algorithm looked a bit like a monster. Then, on later pager, things became a bit more familiar. As an application, they considered the optimization problem

\displaystyle \min_x f(x) + \sum_{i=1}^m (g_i\square l_i)(L_ix - r_i) + h(x) - \langle x,z\rangle

with convex {f}, {g_i}, {l_i} ({l_i} {\nu_i^{-1}} strongly convex), {h} convex with {\eta}-Lipschitz gradient and {L_i} as above. By noting that the parallel sum is related to the infimal convolution of convex functions, things became clearer. Also, the algorithm looks more familiar now (Algorithm 18 in the paper – I’m too lazy to write it down here). They have an analysis of the algorithms that allow to deduce convergence rates for the iterates (usually {\mathcal{O}(1/n)}) but I haven’t checked the details yet.

Sparse Regularization: Convergence Of Iterative Jumping Thresholding Algorithm by Jinshan Zeng, Shaobo Lin, Zongben Xu: At first I was excited but then I realized that they simple tackled

\displaystyle \min F + \lambda \Phi

with smooth {F} and non-smooth, non-convex {\Phi} by “iterative thresholding”, i.e.

\displaystyle x^{n+1} = \mathrm{prox}_{\mu\lambda\Phi}(x^n - \mu \nabla F(x^n)).

The paper really much resembles what Kristian and I did in the paper Minimization of non-smooth, non-convex functionals by iterative thresholding (at least I couldn’t figure out the improvements…).

Some remark before you read this post: It is on a very specialized topic and only presents a theoretical insight which seems to be of no practical value whatsoever. Continue at your own risk of wasting your time.

Morozov’s discrepancy principle is a means to choose the regularization parameter when regularizing inverse ill-posed problems. To fix the setting, we consider two Hilbert spaces {X} and {Y} and a bounded linear operator {A:X\rightarrow Y}. We assume that the range of {A} is a non-closed subset of {Y}. As a consequence of the Bounded Inverse Theorem the pseudo-inverse {A^\dag} (defined on {{\mathrm rg} A \oplus ({\mathrm rg} A)^\bot}) is unbounded.

This is the classical setting for linear inverse problems: We have a process, modeled by {A}, such that the quantity {x^\dag\in X} we are interested in gives rise to on output {y^\dag = Ax^\dag}. We are able to measure {y} but, as it is always the case, the is some noise introduced in the measurement process and hence, we have access to a measured quantity {y^\delta\in Y} which is a perturbation of {y}. Our goal is, to approximate {x^\dag} from the knowledge of {y^\delta}. Note that simply taking {A^\dag y^\delta} is not an option, since firstly {y^\delta} is not guaranteed to be in the domain of the pseudo-inverse and, somehow even more severely and also more practical, the unboundedness of {A^\dag} will lead to a severe (in theory infinitely large) amplification of the noise, rendering the reconstruction useless.

The key idea is to approximate the pseudo-inverse {A^\dag} by a family of bounded operators {R_\alpha:Y\rightarrow X} with the hope that one may have

\displaystyle  R_\alpha y^\delta \rightarrow x^\dag\ \text{when}\ y^\delta\rightarrow y\in{\mathrm rg} A\ \text{and}\ \alpha\rightarrow 0 \ \text{appropriately.} \ \ \ \ \ (1)

(Note that the assumption {\alpha\rightarrow 0} is just a convention. It says that we assume that {R_\alpha} is a closer approximation to {A^\dag} the closer {\alpha} is to zero.) Now, we have two tasks:

  1. Construct a good family of regularizing operators {R_\alpha} and
  2. devise a parameter choice, i.e. a way to choose {\alpha}.

The famous Bakushinskii veto says that there is no parameter choice that can guarantee convergence in~(1) in the worst case and only uses the given data {y^\delta}. The situation changes if one introduces knowledge about the noise level {\delta = \|y-y^\delta\|}. (There is an ongoing debate if it is reasonable to assume that the noise level is known – my experience when working with engineers is that they are usually very good in quantifying the noise present in their system and hence, in my view the assumption that noise level is known is ok.)

One popular way to choose the regularization parameter in dependence on {y^\delta} and {\delta} is Morozov’s discrepancy principle:

Definition 1 Morozov’s discrepancy principle states that one shall choose the regularization parameter {\alpha(\delta,y^\delta)} such that

\displaystyle  \|AR_{\alpha(\delta,y^\delta)}y^\delta - y^\delta\| = c\delta

for some fixed {c>1}.

In other words: You shall choose {\alpha} such that the reconstruction {R_\alpha y^\delta} produces a discrepancy {\|AR_{\alpha}y^\delta - y^\delta\|} which is in the order of and slightly larger than the noise level.

Some years ago I wrote a paper about the use of Morozov’s discrepancy principle when using the augmented Lagragian method (aka Bregman iteration) as an iterative regularization method (where one can view the inverse of the iteration counter as a regularization parameter). The paper is Morozov’s principle for the augmented Lagrangian method applied to linear inverse problems (together with , the arxiv link is here). In that paper we derived an estimate for the (squared) error of {R_\alpha y^\delta} and {x^\dag} that behaves like

\displaystyle  C \frac{c(\sqrt{c}+1)}{\sqrt{c-1}}\delta

for some {C>0} and the {c>1} from Morozov’s discrepancy principle. The somehow complicated dependence on {c} was a bit puzzling to me. One can optimize {c>1} such that the error estimate is optimal. It turns out that {c\mapsto \frac{c(\sqrt{c}+1)}{\sqrt{c-1}}} attains the minimal value of about {4.68} for about {c=1.64}. I blamed the already quite complicated analysis of the augmented Lagragian method for this obviously much too complicated values (and in practice, using {c} much closer to {1} usually lead to much better results).

This term I teach a course on inverse problems and also covered Morozov’s discrepancy principle but this time for much simpler regularization methods, namely for linear methods such as, for example, Tikhonov regularization, i.e.

\displaystyle  R_\alpha y^\delta = (A^* A + \alpha I)^{-1} A^* y^\delta

(but other linear methods exist). There I arrived at an estimate for the (squared) error of {R_\alpha y^\delta} any {x^\dag} of the form

\displaystyle  C(\sqrt{c} + \tfrac1{\sqrt{c-1}})^2\delta.

Surprisingly, the dependence on {c} for this basic regularization method is also not very simple. Optimizing the estimate over {c>1} leads to an optimal value of about {c= 2.32} (with a minimal value of the respective constant of about {5.73}). Again, using {c} closer to {1} usually lead to much better results.

Well, these observations are not of great importance… I just found it curious to observe that the analysis of Morozov’s discrepancy principle seems to be inherently a bit more complicated than I thought…

Today I’d like to collect some comments one a few papers I stumbled upon recently on the arXiv.

1. TGV minimizers in 1D

First, about a month ago two very similar paper appeared in the same week:

Both papers treat the recently proposed “total generalized variation” model (which is a somehow-but-not-really-higher-order generalization of total variation). The total variation of a function {u\in L^1(\Omega)} ({\Omega\subset{\mathbb R}^d}) is defined by duality

\displaystyle  TV(u) = \sup\Big\{\int_\Omega \mathrm{div} \phi\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,{\mathbb R}^d), |\phi|\leq 1\Big\}.

(Note that the demanded high regularity of the test functions {\phi} is not essential here, as we take a supremum over all these functions under the only, but important, requirement that the functions are bounded. Test functions from {C^1_c(\Omega,{\mathbb R}^d)} would also do.)

Several possibilities for extensions and generalization of the total variation exist by somehow including higher order derivatives. The “total generalized variation” is a particular successful approach which reads as (now using two non-negative parameter {\alpha,\beta} which do a weighting):

\displaystyle  TGV_{\beta,\alpha}^2(u) = \sup\Big\{\int_\Omega \mathrm{div}^2 \phi\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,S^{d\times d}),\ |\phi|\leq \beta,\ |\mathrm{div}\phi|\leq \alpha\Big\}.

To clarify some notation: {S^{d\times d}} are the symmetric {d\times d} matrices, {\mathrm{div}^n} is the negative adjoint of {\nabla^n} which is the differential operator that collects all partial derivatives up to the {n}-th order in a {d\times\cdots\times d}-tensor. Moreover {|\phi|} is some matrix norm (e.g. the Frobenius norm) and {|\mathrm{div}\phi|} is some vector norm (e.g. the 2-norm).

Both papers investigate so called denoising problems with TGV penalty and {L^2} discrepancy, i.e. minimization problems

\displaystyle  \min_u \frac12\int_\Omega(u-u^0)^2\, dx + TGV_{\alpha,\beta}^2(u)

for a given {u^0}. Moreover, both papers treat the one dimensional case and investigate very special cases in which they calculate minimizers analytically. In one dimension the definition of {TGV^2} becomes a little more familiar:

\displaystyle  TGV_{\beta,\alpha}^2(u) = \sup\Big\{\int_\Omega \phi''\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,{\mathbb R}),\ |\phi|\leq \beta,\ |\phi'|\leq \alpha\Big\}.

Some images of both papar are really similar: This one from Papafitsoros and Bredies

068_TGV_hat1

and this one from Pöschl and Scherzer
068_TGV_hat2

Although both paper have a very similar scopes it is worth to read both. The calculations are tedious but both paper try to make them accessible and try hard (and did a good job) to provide helpful illustrations. Curiously, the earlier paper cites the later one but not conversely…

2. Generalized conditional gradient methods

Another paper I found very interesting was

This paper shows a nice duality which I haven’t been aware of, namely the one between the subgradient descent methods and conditional gradient methods. In fact the conditional gradient method which is treated is a generalization of the conditional gradient method which Kristian and I also proposed a while ago in the context of {\ell^1}-minimization in the paper Iterated hard shrinkage for minimization problems with sparsity constraints: To minimize the sum

\displaystyle  F(u) + \Phi(u)

with a differentiable {F} and a convex {\Phi} for which the subgradient of {\Phi} is easily invertible (or, put differently, for which you can minimize {\langle u,a\rangle + \Phi(u)} easily), perform the following iteration:

  1. At iterate {u^n} linearize {F} but not {\Phi} and calculate a new point {v^n} by

    \displaystyle  v^n = \mathrm{argmin}_v \langle F'(u^n),v\rangle + \Phi(v)

  2. Choose a stepsize {s^n\in [0,1]} and set the next iterate as a convex combination of {u^n} and {v^n}

    \displaystyle  u^{n+1} = u^n + s_n(v^n - u^n).

Note that for and indicator function

\displaystyle  \Phi(u) = \begin{cases} 0 & u\in C\\ \infty & \text{else} \end{cases}

you obtain the conditional gradient method (also known as Frank-Wolfe method). While Kristian and I derived convergence with an asymptotic rate for the case of {F(u) = \tfrac12\|Ku-f\|^2} and {\Phi} strongly coercive, Francis uses the formulation {F(u) = f(Au)} the assumption that the dual {f^*} of {f} has a bounded effective domain (which say that {f} has linear growth in all directions). With this assumption he obtains explicit constants and rates also for the primal-dual gap. It was great to see that eventually somebody really took the idea from the paper Iterated hard shrinkage for minimization problems with sparsity constraints (and does not think that we do heuristics for {\ell^0} minimization…).

Next Page »

Follow

Get every new post delivered to your Inbox.

Join 54 other followers