Today there are several things I could blog on. The first is the planary by Rich Baraniuk on Compressed Sensing. However, I don’t think that I could reflect the content in a way which would be helpful for a potential reader. Just for the record: If you have the chance to visit one of Rich’s talk: Do it!

The second thing is the talk by Bernd Hofmann on source conditions, smoothness and variational inequalities and their use in regularization of inverse problems. However, this would be too technical for now and I just did not take enough notes to write a meaningful post.

As a third thing I have the talk by Christian Clason on inverse problems with uniformly distributed noise. He argued that for uniform noise it is much better to use an {L^\infty} discrepancy term instead of the usual {L^2}-one. He presented a path-following semismooth Newton method to solve the problem

\displaystyle \min_x \frac{1}{p}\|Kx-y^\delta\|_\infty^p + \frac{\alpha}{2}\|x\|_2^2

and showed examples with different kinds of noise. Indeed the examples showed that {L^\infty} works much better than {L^2} here. But in fact it works even better, if the noise is not uniformly distributed but “impulsive” i.e. it attains bounds {\pm\delta} almost everywhere. It seems to me that uniform noise would need a slightly different penalty but I don’t know which one – probably you do? Moreover, Christian presented the balancing principle to choose the regularization parameter (without knowledge about the noise level) and this was the first time I really got what it’s about. What one does here is, to choose {\alpha} such that (for some {\sigma>0} which only depends on {K}, but not on the noise)

\displaystyle \sigma\|Kx_\alpha^\delta-y^\delta\|_\infty = \frac{\alpha}{2}\|x_\alpha^\delta\|_2^2.

The rational behind this is, that the left hand side is monotonically non-decreasing in {\alpha}, while the right hand side is monotonically non-increasing. Hence, there should be some {\alpha} “in the middle” which make both somewhat equally large. Of course, we do neither want to “over-regularize” (which would usually “smooth too much”) nor to “under-regularize” (which would not eliminate noise). Hence, balancing seems to be a valid choice. From a practical point of view the balancing is also nice because one can use the fixed-point iteration

\displaystyle \alpha^{n+1} = 2\sigma\frac{\|Kx_{\alpha^n}^\delta - y^\delta\|_\infty}{\|x_{\alpha_n}^\delta\|_2^2}

which converges in a few number of iterations.

Then there was the talk by Esther Klann, but unfortunately, I was late so only heard the last half…

Last but not least we have the talk by Christiane Pöschl. If you are interested in Total-Variation-Denoising (TV denoising), then you probably have heard many times that “TV denoising preserves edges” (have a look at the Wikipedia page – it claims this twice). What Christiane showed (in a work with Vicent Caselles and M. Novaga) that this claim is not true in general but only for very special cases. In case of characteristic functions, the only functions for which the TV minimizer has sharp edges are these so-called calibrated sets, introduced by Caselles et el. Building on earlier works by Caselles and co-workers she calculated exact minimizers for TV denoising in the case that the image consists of characteristic functions of two convex sets or of a single star shaped domain, that is, for a given set B she calculated the solution of

\displaystyle \min_u\int (u - \chi_B)^2dx + \lambda \int|Du|.

This is not is as easy as it may sound. Even for the minimizer for a single convex set one has to make some effort. She presented a nice connection of the shape of the obtained level-sets with the morphological operators of closing and opening. With the help of this link she derived a methodology to obtain the exact TV denoising minimizer for all parameters. I do not have the images right now but be assured that most of the time, the minimizers do not have sharp edges all over the place. Even for simple geometries (like two rectangles touching in a corner) strange things happen and only very few sharp edges appear. I’ll keep you posted in case the paper comes out (or appears as a preprint).

Christiane has some nice images which make this much more clear:

For two circles edges are preserved if they are far enough away from each other. If they are close, the area “in between” them is filled and, moreover, obey this fuzzy boundary. I remember myself seeing effects like this in the output of TV-solvers and thinking “well, it seems that the algorithm is either not good or not converged yet – TV should output sharp edges!”.

 

For a star-shaped shape (well, actually a star) the output looks like this. The corners are not only rounded but also blurred and this is true both for the “outer” corners and the “inner” corners.

 

So, if you have any TV-minimizing code, go ahead and check if your code actually does the right things on images like this!
Moreover, I would love to see similar results for more complicated extensions of TV like Total Generalized Variation, I treated here.

 

 

 

 

The scientific program at ISMP started today and I planned to write a small personal summary of each day. However, it is a very intense meeting. Lot’s of excellent talks, lot’s of people to meet and little spare time. So I’m afraid that I have to deviate from my plan a little bit. Instead of a summary of every day I just pick out a few events. I remark that these picks do not reflect quality, significance or something like this in any way. I just pick things for which I have something to record for personal reasons.

My day started after the first plenary which the session Testing environments for machine learning and compressed sensing in which my own talk was located. The session started with the talk by Michael Friedlander of the SPOT toolbox. Haven’t heard of SPOT yet? Take a look! In a nutshell its a toolbox which turns MATLAB into “OPLAB”, i.e. it allows to treat abstract linear operators like matrices. By the way, the code is on github.

The second talk was by Katya Scheinberg (who is giving a semi-planary talk on derivative free optimization at the moment…). She talked about speeding up FISTA by cleverly adjusting step-sizes and over-relaxation parameters and generalizing these ideas to other methods like alternating direction methods. Notably, she used the “SPEAR test instances” from our project homepage! (And credited them as “surprisingly hard sparsity problems”.)

My own talk was the third and last one in that session. I talked about the issue of constructing test instance for Basis Pursuit Denoising. I argued that the naive approach (which takes a matrix {A}, a right hand side {b} and a parameter {\lambda} and let some great solver run for a while to obtain a solution {x^*}) may suffer from “trusted method bias”. I proposed to use “reverse instance construction” which is: First choose {A}, {\lambda} and the solution {x^*} and the construct the right hand side {b} (I blogged on this before here).

Last but not least, I’d like to mention the talk by Thomas Pock: He talked about parameter selection on variational models (think of the regularization parameter in Tikhonov, for example). In a paper with Karl Kunisch titled A bilevel optimization approach for parameter learning in variational models they formulated this as a bi-level optimization problem. An approach which seemed to have been overdue! Although they treat somehow simple inverse problems (well, denoising) (but with not so easy regularizers) it is a promising first step in this direction.

Today I’d like to blog about two papers which appeared on the arxiv.

1. Regularization with the Augmented Lagrangian Method – Convergence Rates from Variational Inequalities

The first one is the paper “Regularization of Linear Ill-posed Problems by the Augmented Lagrangian Method and Variational Inequalities” by Klaus Frick and Markus Grasmair.

Well, the title basically describes the content quite accurate. However, recall that the Augmented Lagrangian Method (ALM) is a method to calculate solutions to certain convex optimization problems. For a convex, proper and lower-semicontinuous function {J} on a Banach space {X}, a linear and bounded operator {K:X\rightarrow H} from {X} into a Hilbert space {H} and an element {g\in H} consider the problem

\displaystyle   \inf_{u} J(u)\quad\text{s.t.}\quad Ku=g. \ \ \ \ \ (1)

The ALM goes as follows: Start with an initial dual variable {p_0}, choose step-sizes {\tau_k>0} and iterate

\displaystyle  u_k \in \text{argmin}\Big(\frac{\tau_k}{2}\|Ku-g\|^2 + J(u) + \langle p_{k-1},Ku-g\rangle\Big)

\displaystyle  p_k = p_{k-1}+\tau_k(g-Ku_k).

(These days one should note that this iteration is also known under the name Bregman iteration…). Indeed, it is known that the ALM converges to a solution of (1) if there exists one. Klaus and Markus consider the ill-posed case, i.e. the range of {K} is not closed and {g} is replaced by some {g^\delta} which fulfills {\|g-g^\delta\|\leq\delta} (and hence, {g^\delta} is generally not in the range of {K}). Then, the ALM does not converge but diverges. However, one observes “semi-convergence” in practice, i.e. the iterates approach an approximate “solution to {Ku=g^\delta}” (or even a true solution to {Ku=g}) first but then start to diverge from some point on. Then it is natural to ask, if the ALM with {g} replaced by {g^\delta} can be used for regularization, i.e. can one choose a stopping index {k^*} (depending on {\delta} and {g^\delta}) such that the iterates {u_{k^*}^\delta} approach the solution of (1) if {\delta} vanishes? The question has been answered in the affirmative in previous work by Klaus (here and here) and also estimates on the error and convergence rates have been derived under an additional assumption on the solution of (1). This assumption used to be what is called “source condition” and says that there should exist some {p^\dag\in H} such that for a solution {u^\dagger} of (1) it holds that

\displaystyle  K^* p^\dagger \in\partial J(u^\dagger).

Under this assumption it has been shown that the Bregman distance {D(u_{k^*}^\delta,u^\dag)} goes to zero linearly in {\delta} under appropriate stopping rules. What Klaus and Markus investigate in this paper are different conditions which ensure slower convergence rates than linear. These conditions come in the form of “variational inequalities” which gained some popularity lately. As usual, these variational inequalities look some kind of messy at first sight. Klaus and Markus use

\displaystyle  D(u,u^\dag)\leq J(u) - J(u^\dag) + \Phi(\|Ku-g\|^2)

for some positive functional {D} with {D(u,u)=0} and some non-negative, strictly increasing and concave function {\Phi}. Under this assumption (and special {D}) they derive convergence rates which again look quite complicated but can be reduced to simpler and more transparent cases which resemble the situation one knows for other regularization methods (like ordinary Tikhonov regularization).

In the last section Klaus and Markus also treat sparse regularization (i.e. with {J(u) = \|u\|_1}) and derive that a weak condition (like {(K^*K)^\nu p^\dag\in\partial J(u^\dag)} for some {0<\nu<1/2} already imply the stronger one (1) (with a different {p^\dag}). Hence, interestingly, it seems that for sparse regularization one either gets a linear rate or nothing (in this framework).

2. On necessary conditions for variational regularization

The second paper is “Necessary conditions for variational regularization schemes” by Nadja Worliczek and myself. I have discussed some parts of this paper alread on this blog here and here. In this paper we tried to formalize the notion of “a variational method” for regularization with the goal to obtain necessary conditions for a variational scheme to be regularizing. As expected, this goal is quite ambitions and we can not claim that we came up with ultimate necessary condition which describe what kind of variational methods are not possible. However, we could first relate the three kinds of variational methods (which I called Tikhonov, Morozov and Ivanov regularization here) and moreover investigated the conditions on the data space a little closer. In recent years it turned out that one should not always use a term like {\|Ku-g^\delta\|^2} to measure the noise or to penalize the deviation from {Ku} to {g^\delta}. For several noise models (like Poisson noise or multiplicative noise) other functionals are better suited. However, these functionals raise several issues: They are often not defined on a linear space but on a convex set, sometimes with the nasty property that their interior is empty. They often do not have convenient algebraic properties (e.g. scaling invariance, triangle inequalities or the like). Finally they are not necessarily (lower semi-)continuous with respect to the usual topologies. Hence, we approached the data space from quite abstract way: The data space {(Y,\tau_Y)} is topological space which comes with an additional sequential convergence structure {\mathcal{S}} (see e.g. here) and on (a subset of) which there is a discrepancy functional {\rho:Y\times Y\rightarrow [0,\infty]}. Then we analyzed the interplay of these three things {\tau_Y}, {\mathcal{S}} and {\rho}. If you wonder why we use the additional sequential convergence structure, remember that in the (by now classical) setting for Tikhonov regularization in Banach spaces with a functional like

\displaystyle  \|Ku-g^\delta\|_Y^q + \alpha\|u\|_X^p

with some Banach space norms {\|\cdot\|_Y} and {\|\cdot\|_X} there are also two kinds of convergence on {Y}: The weak convergence (which is replaced by {\tau_Y} in our setting) which is, e.g., used to describe convenient (lower semi-)continuity properties of {K} and the norm {\|\cdot\|_Y} and the norm convergence which is used to describe that {g^\delta\rightarrow g^\dag} for {\delta\rightarrow 0}. And since we do not have a normed space {Y} in our setting and one does not use any topological properties of the norm convergence in all the proofs of regularizing properties, Nadja suggested to use a sequential convergence structure instead.

Today I would like to comment on two arxiv-preprints I stumbled upon:

1. “Augmented L1 and Nuclear-Norm Models with a Globally Linearly Convergent Algorithm” – The Elastic Net rediscovered

The paper “Augmented L1 and Nuclear-Norm Models with a Globally Linearly Convergent Algorithm” by Ming-Jun Lai and Wotao Yin is another contribution to a field which is (or was?) probably the fastest growing field in applied mathematics: Algorithms for convex problems with non-smooth {\ell^1}-like terms. The “mother problem” here is as follows: Consider a matrix {A\in{\mathbb R}^{m\times n}}, {b\in{\mathbb R}^m} try to find a solution of

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1\quad\text{s.t.}\quad Ax=b

or, for {\sigma>0}

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1\quad\text{s.t.}\quad \|Ax-b\|\leq\sigma

which appeared here on this blog previously. Although this is a convex problem and even has a reformulation as linear program, some instances of this problem are notoriously hard to solve and gained a lot of attention (because their applicability in sparse recovery and compressed sensing). Very roughly speaking, a part of its hardness comes from the fact that the problem is neither smooth nor strictly convex.

The contribution of Lai and Yin is that they analyze a slight perturbation of the problem which makes its solution much easier: They add another term in the objective; for {\alpha>0} they consider

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1 + \frac{1}{2\alpha}\|x\|_2^2\quad\text{s.t.}\quad Ax=b

or

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1 + \frac{1}{2\alpha}\|x\|_2^2\quad\text{s.t.}\quad \|Ax-b\|\leq\sigma.

This perturbation does not make the problem smooth but renders it strongly convex (which usually makes the dual more smooth). It turns out that this perturbation makes life with this problem (and related ones) much easier – recovery guarantees still exists and algorithms behave better.

I think it is important to note that the “augmentation” of the {\ell^1} objective with an additional squared {\ell^2}-term goes back to Zou and Hastie from the statistics community. There, the motivation was as follows: They observed that the pure {\ell^1} objective tends to “overpromote” sparsity in the sense that if there are two columns in {A} which are almost equally good in explaining some component of {b} then only one of them is used. The “augmented problem”, however, tends to use both of them. They coined the method as “elastic net” (for reasons which I never really got).

I also worked on elastic-net problems for problems in the form

\displaystyle  \min_x \frac{1}{2}\|Ax-b\|^2 + \alpha\|x\|_1 + \frac{\beta}{2}\|x\|_2^2

in this paper (doi-link). Here it also turns out that the problem gets much easier algorithmically. I found it very convenient to rewrite the elastic-net problem as

\displaystyle  \min_x \frac{1}{2}\|\begin{bmatrix}A\\ \sqrt{\beta} I\end{bmatrix}x-\begin{bmatrix}b\\ 0\end{bmatrix}\|^2 + \alpha\|x\|_1

which turns the elastic-net problem into just another {\ell^1}-penalized problem with a special matrix and right hand side. Quite convenient for analysis and also somehow algorithmically.

2. Towards a Mathematical Theory of Super-Resolution

The second preprint is “Towards a Mathematical Theory of Super-Resolution” by Emmanuel Candes and Carlos Fernandez-Granda.

The idea of super-resolution seems to pretty old and, very roughly speaking, is to extract a higher resolution of a measured quantity (e.g. an image) than the measured data allows. Of course, in this formulation this is impossible. But often one can gain something by additional knowledge of the image. Basically, this also is the idea behind compressed sensing and hence, it does not come as a surprise that the results in compressed sensing are used to try to explain when super-resolution is possible.

The paper by Candes and Fernandez-Granada seems to be pretty close in spirit to Exact Reconstruction using Support Pursuit on which I blogged earlier. They model the sparse signal as a Radon measure, especially as a sum of Diracs. However, different from the support-pursuit-paper they use complex exponentials (in contrast to real polynomials). Their reconstruction method is basically the same as support pursuit: The try to solve

\displaystyle   \min_{x\in\mathcal{M}} \|x\|\quad\text{s.t.}\quad Fx=y, \ \ \ \ \ (1)

i.e. they minimize over the set of Radon measures {\mathcal{M}} under the constraint that certain measurements {Fx\in{\mathbb R}^n} result in certain given values {y}. Moreover, they make a thorough analysis of what is “reconstructable” by their ansatz and obtain a lower bound on the distance of two Diracs (in other words, a lower bound in the Prokhorov distance). I have to admit that I do not share one of their claims from the abstract: “We show that one can super-resolve these point sources with infinite precision—i.e. recover the exact locations and amplitudes—by solving a simple convex program.” My point is that I can not see to what extend the problem (1) is a simple one. Well, it is convex, but it does not seem to be simple.

I want to add that the idea of “continuous sparse modelling” in the space of signed measures is very appealing to me and appeared first in Inverse problems in spaces of measures by Kristian Bredies and Hanna Pikkarainen.

On my way to ENUMATH 11 in Leicester I stumbled upon the preprint Multi-parameter Tikhonov Regularisation in Topological Spaces by Markus Grasmair. The paper deals with fairly general Tikhonov functionals and its regularizing properties. Markus considers (nonlinear) operators {F:X\rightarrow Y} between two set {X} and {Y} and analyzes minimizers of the functional

\displaystyle T(x) = S(F(x),y) + \sum_k \alpha_k R_k(x).

The functionals {S} and {R_k} play the roles of a similarity measure and regularization terms, respectively. While he also treats the issue of noise in the operator and the multiple regularization terms, I was mostly interested in his approach to the general similarity measure. The category in which he works in that of topological spaces and he writes:

“Because anyway no trace of an original Hilbert space or Banach space structure is left in the formulation of the Tikhonov functional {T} […], we will completely discard all assumption of a linear structure and instead consider the situation, where both the domain {X} and the co-domain {Y} of the operator {F} are mere topological spaces, with the topology of {Y} defined by the distance measure {S}.”

The last part of the sentence is important since previous papers often worked the other way round: Assume some topology in {Y} and then state conditions on {S}. Nadja Worliczek observed in her talk “Sparse Regularization with Bregman Discrepancy” at GAMM 2011 that it seems more natural to deduce the topology from the similarity measure and Markus took the same approach. While Nadja used the notion of “initial topology” (that is, take the coarsest topology that makes the functionals {y\mapsto S(z,y)} continuous), Markus uses the following family of pseudo-metrics: For {z\in Y} define

\displaystyle d^{(z)}(y,\tilde y) = |S(z,y)-S(z,\tilde y)|.

Unfortunately, the preprint is a little bit too brief for me at this point and I did not totally get what he means with “the topology {\sigma} induced by the uniformity induced by the pseudo-metric”. Also, I am not totally sure if “pseudo-metric” is unambiguous.. However, the topology he has in mind seems to be well suited in the sense that {y^n\rightarrow y} if {S(z,y^n)\rightarrow S(z,y)} for all {z}. Moreover, the condition that {S(z,y)=0} iff {z=y} implies that {\sigma} is Hausdorff. It would be good to have a better understanding on how the properties of the similarity measure are related to the properties of the induced topology. Are there examples in which the induced topology is both different from usual norm and weak topologies and also interesting?

Moreover, I would be interested, in the relations of the two approaches: via “uniformities” and the initial topology…

1. A numerical experiment on sparse regularization

To start, I take a standard problem from the Regularization Tools Matlab toolbox: The problem \texttt{deriv2}. This problem generates a matrix {A} and two vectors {x} and {b} such that the equation {Ax=b} is a Galerkin discretization of the integral equation

\displaystyle  g(t) = \int_0^1 k(t,s)f(s) ds

with a kernel {k} such that the solution amounts to solving a boundary value problem. The Galerkin ansatz functions are simply orthonormal characteristic functions on intervals, i.e. {\psi_i(x) = h^{-1/2}\chi_{[ih,(i+1)h]}(x)}. Thus, I work with matrices {A_h} and vectors {x_h} and {b_h}.

I want to use sparse regularization to reconstruct spiky solutions, that is, I solve problems

\displaystyle  \min_{x_h} \tfrac{1}{2}\|A_h x_x - b_h\|^2 + \alpha_h\|x_h\|_1.

Now, my first experiment goes as follows:

Experiment 1 (Discretization goes to zero)
I generate spiky data: I fix a point {t_0} in the interval {[0,1]}, namely {t_0 = 0.2}, and a value {a_0=1}. Now I consider the data {f} which is a delta peak of height {a_0} and {t_0} (which in turn leads to a right hand side {g}). I construct the corresponding {x_h} and the right hand side {b_h=A_hx_h}. Now I aim at solving

\displaystyle  \min_f \tfrac{1}{2}\| g - \int_0^1 k(\cdot,s)f(s)ds\|_2^2 + \alpha \|f\|_1

for different discretizations ({h\rightarrow 0}). In the numerics, I have to scale {\alpha} with {h}, i.e. I solve

\displaystyle  \min_{x_h} \tfrac{1}{2}\|A_h x_x - b_h\|^2 + h\,\alpha\|x_h\|_1.

and I obtain the following results: In black I show the data {x}, {b} and so on, and in blue I plot the minimizer and its image under {A}.

For {n=10}:

For {n=50}:

For {n=100}:

For {n=500}:

For {n=1000}:

Note that the scale varies in the pictures, except in the lower left one where I show the discretized {g}. As is should be, this converges nicely to a piecewise linear function. However, the discretization of the solution blows up which is also as it should be, since I discretize a delta peak. Well, this basically shows, that my scaling is correct.

From the paper Sparse regularization with {\ell^q} penalty term one can extract the following result.

Theorem 1 Let {K:\ell^2\rightarrow Y} be linear, bounded and injective and let {u^\dagger \in \ell^2} have finite support. Moreover let {g^\dagger = Ku^\dagger} and {\|g^\dagger-g^\delta\|\leq \delta}. Furthermore, denote with {u_\alpha^\delta} the minimizer of

\displaystyle  \tfrac12\|Ku-g^\delta\|^2 + \alpha\|u\|_1.

Then, for {\alpha = c\delta} it holds that

\displaystyle  \|u_\alpha^\delta - u^\dagger\|_1 = \mathcal{O}(\delta).

Now let’s observe this convergence rates in a second experiment:

Experiment 2 (Convergence rate {\mathcal{O}(\delta)}) Now we fix the discretization (i.e. {n=500}), and construct a series of {g^\delta}‘s for {\delta} in a logscale between {1} and {10^{-6}}. I scale {\alpha} proportional to {\delta} and caluclate minimizers of

\displaystyle  \min_{x_h} \tfrac{1}{2}\|A_h x_x - b_h\|^2 + h\,\alpha\|x_h\|_1.

The I measure the error {\|f_\alpha^\delta-f^\dagger\|_1} and plot it doubly logarithmically against {\delta}.

And there you see the linear convergence rate as predicted.

In a final experiment I vary both {\delta} and {n}:

Experiment 3 ({\delta\rightarrow 0} and “{n\rightarrow\infty}”)Now we repeat Experiment 1 for different {n} and put all the loglog plots in one figure. This looks like this: You clearly observe the linear convergence rate in any case. But there is another important thing: The larger the {n} (i.e. the smaller the {h}), the later the linear rate kicks is (i.e. for smaller {\delta}). You may wonder what the reason is. By looking at the reconstruction for varying {n} and {\delta} (which I do not show here) you see the following behavior: For large noise the regularized solutions consist of several peaks located all over the place and with vanishing noise, one peak close to the original one gets dominant. However, this peak is not at the exact position, but at a slightly larger {t}; moreover, it is slightly smaller. Then, this peak moves towards the right position and is also getting larger. Finally, the peak arrives at the exact position and remains there while approaching the correct height.

Hence, the linear rate kicks in, precisely when the accuracy is higher than the discretization level.

Conclusion:

  • The linear convergence rate is only present in the discrete case. Moreover, it starts at a level which can not be resolved by the discretization.
  • “Sparsity penalties” in the continuous case are a different and delicate matter. You may consult the preprint “Inverse problems in spaces of measures” which formulates the sparse recovery problem in a continuous setting but in the space of Radon measures rather than in {L^1} (which is simply not working). There Kristian and Hanna show weak* convergence of the minimizers.
  • Finally, for “continuous sparsity” also some kind of convergence is true, however, not in norm (which really should be the variation norm in measure space). Weak* convergence can be quantified by the Prokhorov metric or the Wasserstein metric (which is also called earth movers distance in some comiunities). Convergence with respect to these metric should be true (under some assumptions) but seem hard to prove. Convergence rates would be cool, but seem even harder.

In this post I would like to comment on two papers I “stumbled upon”, one in regularization theory and one in image processing.

The first one is A regularization parameter for nonsmooth Tikhonov regularization by Kazufumi Ito, Bangti Jin and Tomoya Takeuchi. As the title announces, the paper addresses the problem of determining suitable regularization parameter for some kind of Tikhonov regularization. In particular, the authors propose a new heuristic method, i.e. method which does not use any estimate of the noise level in the data. This is an interesting and important topic for several reasons:

  1. Practically, estimates on the noise level are rarely available and if they are, they are not too reliable.
  2. Strictly speaking, these kind of rules are “bad” since there is the “Bakushinksii Veto”: Such rules only provide regularizations (e.g. in the sense of Engl, Hanke, Neubauer for problems which are well-posed (as a great service, the authors state and prove this statement as Theore 3.2).
  3. Despite this veto, several heuristic rules produce excellent results in practice.

Note that the last second points are not in contradiction. They merely say that the notion of “regularization” may be too strict. Usually, it uses a worst case estimate which may practically never observed.

The paper contributes a new rule and state that it is applicable to a broad range of problems. They use very general Tikhonov functional:

\displaystyle  \phi(x,y^\delta) + \eta\psi(x)

and do not assume that {\phi} or {\psi} are smooth. They use the value function

\displaystyle  F(\eta) = \min_x \phi(x,y^\delta) + \eta\psi(x)

and propose the following rule for {\eta}: For some {\gamma} choose {\eta} such that

\displaystyle  \Phi_\gamma(\eta) = \frac{F(\eta)^{1+\gamma}}{\eta}

is minimal. I do not have any intuition for this rule (however, from they proofs you see that they work, at least for “partially smooth cases”, see below). Lacking a name for this rule, one may use the term “weighted value function rule”.

They prove several nice properties of the value function (continuity, monotonicity and concavity) with loose assumptions on {\phi} and {\psi} (especially they do not even need existence of minimizers for {\phi(x,y^\delta) + \eta\psi(x)}, only that the minimum exists). However, when it comes to error estimates, they only obtain results for a specific discrepancy measure, namely a squares Hilbert space norm:

\displaystyle  \phi(x,y^\delta) = \tfrac12\|Kx-y^\delta\|^2.

It seems that, for general convex and lower-semicontinuous penalties {\psi} they build upon results from my paper with Bangti Jin on the Hanke-Raus rule and the quasi-optimality principle.

Another contribution of the paper is that it gives an algorithm that realizes the weighted value function rule (a thing which I omitted in my paper with Bangti). Their numerical experiments demonstrate that the weighted value function rule and the proposed algorithm works well for academic examples.

The next paper I want to discuss is the preprint Properties of {L^1-\text{TGV}^2}: The one-dimensional case by Kristian Bredies, Karl Kunisch and Tuomo Valkonen. There the authors analyze the somehow recent generalization “total generalized variation” {\text{TGV}} of the omnipresent total variation. The TGV has been proposed by Bredies, Kunisch and Pock in this paper recently and Kristian and me also briefly described it in our book on mathematical image processing. Loosely speaking, the TGV shall be a generalization of the usual total variation which does not lead to “staircasing”. While one may observe from the construction of the TGV functional, that staircasing is not to be expected, the authors in this paper give precise statements. By restricting to the one dimensional case they prove several interesting properties of the TGV functional, most notably that it leads to an equivalent norm of the space {BV}.

Maybe I should state the definitions here: The total variation of a function {u\in L^1(\Omega)} is

\displaystyle  \text{TV}(u) = \sup\{\int_\Omega u v'\ |\ v\in C^1_c(\Omega),\ \|v\|_\infty\leq 1\}

leading the the {BV}-norm

\displaystyle  \|u\|_{BV} = \|u\|_{L^1} + \text{TV}(u).

The {\text{TGV}^2} seminorm for a parameter tuple {(\alpha,\beta)} is

\displaystyle  \text{TGV}^2_{(\alpha,\beta)}(u) = \sup\{\int_\Omega u v''\ |\ C^2_c(\Omega), \|v\|_\infty\leq\beta,\ \|v'\|_\infty\leq\alpha\}

and the associated norm is

\displaystyle  \|u\|_{BGV^2} = \|u\|_{L^1} + \text{TGV}^2(u).

In Lemma 3.3 they prove that {\|\cdot\|_{BV}} and {\|\cdot\|_{BGV^2}} are equivalent norms on {\text{BV}}. In Section 4 they show that minimizers of

\displaystyle  \|u-f\|_{L^1} + \alpha\text{TV}(u)

obey staircasing of degree 0, i.e. the solution {u} is piecewise constant when it is not equal to {f}. For the minimizers of

\displaystyle  \|u-f\|_{L^1} + \text{TGV}^2_{(\alpha,\beta)}(u)

one has staircasing of degree 1: {u} is affine linear where it is not equal to {f}.

These two facts combined (norm equivalence of {\text{BV}} and {\text{BGV}^2} and the staircasing of degree 1) seem quite remarkable to me. They somehow show that staircasing is not related to the space {\text{BV}} of functions of bounded variation but only to the specific {\text{TV}} semi-norm. This is somehow satisfying since I still remember the thorough motivation of L. Rudin in his 1987 thesis for the usage of the space {\text{BV}} in image processing: If there where images which are not in {\text{BV}} we could not observe them. (He even draws an analogy to the question: How many angles can dance on the point of a needle?) Moreover, he further argues that {\text{BV}} is not too large in the sense that its elements are still accessible to analysis (e.g. in defining a weak notion of curvature although they may be discontinuous). The {\text{BGV}^2}-model shows that it is possible to overcome the undesired effect of staircasing while staying in the well founded and mathematically sound and appealing framework of {\text{BV}}.

The paper contains several more interesting results (e.g. on preservation of continuity and “affinity” and on convergence of with respect to {(\alpha,\beta)} which I do not collect here.

Coming back to regularization, especially Ivanov regularization. Recall that I used the term Ivanov regularization for the minimization problem

\displaystyle   \min S(Ax,y^\delta)\ \text{ s.t. }\ R(x)\leq \tau. \ \ \ \ \ (1)

I again stumbled upon some reference: It seems that in the case that the constraint {R(x)\leq \tau} defines a compact set, this method is usually referred to as “method of quasi solutions”. More precisely, I found this in “Elements of the theory of inverse problems” by A.M. Denisov, Chapter 6. There he uses metric spaces and proves the following:

Theorem 1 Let {X,Y} be metric spaces with metrics {d_X}, {d_Y}, respectively and {A:X\rightarrow Y} continuous. Furthermore let {M\subset X} be compact, {y^\dagger} be in the range of {A} and assume that {x^\dagger} is the unique solution of {Ax=y^\dagger} which lies in {M}. Finally for a {y^\delta} with {d_Y(y^\delta,y^\dagger)\leq\delta} define {X_\delta = \{x\ :\ d_Y(Ax,y^\delta)\leq\delta\}} and {X_\delta^M = X_\delta\cap M}. Then it holds for {\delta\rightarrow 0} that

\displaystyle  \sup_{x\in X_\delta^M}d_X(x,x^\dagger) \rightarrow 0.

Remark 1 Before we prove this theorem, we relate is to what I called Ivanov regularization above: The set {M} is encoded in~(1) as {M = \{x\ :\ R(x)\leq\tau\}} and the “discrepancy measure” {S} is simply the metric {d_Y}. Hence, let {x_M^\delta} denote a solution of

\displaystyle  \min\ d_Y(Ax,y^\delta)\ \text{ s.t. } x\in M.

Because {x^\dagger} is feasible for this problem it follows from {d_Y(Ax^\dagger,y^\delta) = d_Y(y^\dagger,y^\delta)\leq\delta} that {d_Y(Ax_M^\delta,y^\delta)\leq\delta}. Hence, {x_M^\delta\in X_M^\delta}. In other words: Ivanov regularization produces one element in the set {X_M^\delta}. Now, the theorem says that every element in {X_M^\delta} is a good approximation for {x^\dagger} (at least asymptotically).

Proof: We take a sequence {\delta_n\rightarrow 0} and assume to the contrary that there exist {\epsilon>0} such that for every {n} there exists {x_{\delta_n}\in X_M^{\delta_n}} such that it holds that {d_X(x_{\delta_n},x^\dagger)\geq \epsilon}. Since all {x_{\delta_n}} lie in {M} which is compact, there is a convergent subsequence {(x_k)} with limit {\bar x}. We obtain {d_X(\bar x,x^\dagger)\geq \epsilon}. However, this contradicts the assumption: d_Y(A\bar x,Ax^\dagger) & = &d_Y(A\bar x,y^\dagger) = \lim_{n\rightarrow \infty} d_Y(Ax_{\delta_n},y^\dagger) \nonumber
& \leq &\lim_{n\rightarrow \infty} d_Y(Ax_{\delta_n},y^{\delta_n}) + d_Y(y^{\delta_n},y^\dagger) \leq \lim_{n\rightarrow\infty}2\delta_n =0. \Box

Coming back to the interpretation of the Theorem~1 and Ivanov regularization: Instead of Ivanov regularization, one could also use the following feasibility problem: Find an {x} such that both {d_Y(Ax,y^\delta)\leq\delta} and {x\in M}. For the case of vector spaces {X} and {Y} and a convex set {M}, this would be a convex feasibility problem which one may attack by available methods.

A further important remark is that we did not assume any linearity on {A} (of course: we did not even assume a linear structure on {X} or {Y}). Hence, the theorem seem very powerful: There is no regularization parameter involved and one still gets convergence to the true solution! However, one of the assumptions in the theorem is somehow strong: The uniqueness of {x^\dagger}. To illustrate this we consider a special case:

Example 1 Let {X} and {Y} be real (or complex) vector spaces and {A} be linear with non-trivial null space. Furthermore, assume that {M\subset X} is convex and compact and consider scaled versions {\tau M} for {\tau>0}. Then the set of solutions of {Ax=y^\dagger} is an affine space in {X} and there are three cases for the intersection of this set and {\tau M}:

  1. The intersection is empty.
  2. The intersection is a convex set and contains infinitely many elements.
  3. The intersection contains exactly one element.

The last case occurs precisely, when the affine space of solution is tangential to {\tau M}. Loosely speaking, one may say that this case only occurs, if the set {M} is scaled precisely to the right size such that is only touches the affine space of solutions.

Another strong assumption in Theorem~1 is that the set {M} is compact. First there is a way to somehow relax this condition. Basically, we only need compactness to obtain the converging subsequence. Hence, one could try to work with a weaker topology on {Y} (which would result in a weaker notion of compactness) and then obtain a limit of a subsequence which converges in the weaker sense only. Then one would need some tool to deduce that the weak limit is indeed a solution. This strategy work, for example in Banach spaces:

Example 2 Let {X} and {Y} be reflexive Banach spaces and {A:X\rightarrow Y} be linear, bounded and one-to-one. We use the set {M = \{x\ :\ \|x\|_X\leq R\}} as prior knowledge on the solution of {Ax=y^\dagger}. Moreover, we use the metrics induced the norms of {X} and {Y}, respectively: {d_X(x,x') = \|x-x'\|_X} and {d_Y(y,y') = \|y-y'\|_Y}.

Obviously, {M} is not compact (if {X} is of infinite dimension) but it is weakly compact (and by the Eberlein-Smulian theorem also weakly-sequentially compact). In the situation of the proof of Theorem~1 we only get a weakly converging subsequence {x_{\delta_n}\rightharpoonup \bar x}. However, a linear operator {A} is also weak-to-weak linear, and hence {Ax_{\delta_n}\rightharpoonup A\bar x}. While we only have a weakly converging sequence, we still can obtain the contradiction in~(0) since the norm is weakly lower semicontinuous.

Another way to justify the assumption that the solution is in a known compact set, is that in practice we always use a representation of the solution which only use a finite number of degrees of freedom (think of a Galerkin ansatz for example). However, this interpretation somehow neglects that we are interested in finding the true solution to the true infinite dimensional problem and that the discretization of the problem should be treated as a different issue. Just building on the regularizing effect of discretization will almost surely result in a method which stability properties depend on the resolution of the discretization.

Finally: Another good reference of these somehow ancient results in regularization theory is one of the first books on this topics: “Solutions of ill-posed problems” by Tikhonov and Arsenin (1977). While it took me some time to get used the type of presentation, I have to admit that it is really worth to read this book (and other translation of Russian mathematical literature).

Recently Arnd Rösch and I organized the minisymposium “Parameter identification and nonlinear optimization” at the SIAM Conference on Optimization. One of the aims of this symposium was, to initiate more connections between the communities in optimal control of PDEs on the one hand and regularization of ill-posed problems on the other hand. To give a little bit of background, let me somehow formulate the “mother problems” in both fields:

Example 1 (Mother problem in optimal control of PDEs) We consider a bounded Lipschitz domain {\Omega} in {{\mathbb R}^2} (or {{\mathbb R}^3}). Assume that we are given a target (or desired state) {y_d} which is a real valued function of {\Omega}. Our aim is to find a function (or control) {u} (also defined on {\Omega}) such that the solution of the equation

\displaystyle  \begin{array}{rcl}  \Delta y & = u,\ \text{ on }\ \Omega\\ y & = 0,\ \text{ on }\ \partial\Omega. \end{array}

Moreover, our solution (or control) shall obey some pointwise bounds

\displaystyle  a\leq u \leq b.

This motivates the following constrained optimization problem

\displaystyle  \begin{array}{rcl}  \min_{y,u} \|y - y_d\|_{L^2}^2\quad \text{s.t.} & \Delta y = u,\ \text{ on }\ \Omega\\ & y = 0,\ \text{ on }\ \partial\Omega.\\ & a\leq u\leq b. \end{array}

Often, also the regularized problem is considered: For a small {\alpha>0} solve:

\displaystyle  \begin{array}{rcl}  \min_{y,u} \|y - y_d\|_{L^2}^2 + \alpha\|u\|_{L^2}^2\quad \text{s.t.} & \Delta y = u,\ \text{ on }\ \Omega\\ & y = 0,\ \text{ on }\ \partial\Omega.\\ & a\leq u\leq b. \end{array}

(This problem is also extensively treated section 1.2.1 in the excellent book “Optimal Control of Partial Differential Equations” by Fredi Tröltzsch.)

For inverse problems we may formulate:

Example 2 (Mother problem in inverse problems) Consider a bounded and linear operator {A:X\rightarrow Y} between two Hilbert spaces and assume that {A} has non-closed range. In this case, the pseudo-inverse {A^\dagger} is not a bounded operator. Consider now, that we have measured data {y^\delta\in Y} that is basically a noisy version of “true data” {y\in\text{range}A}. Our aim is, to approximate a solution of {Ax=y} by the knowledge of {y^\delta}. Since {A} does no have a closed range, it is usually the case that {y^\delta} is not in the domain of the pseudo inverse and {A^\dagger y^\delta} simply does not make sense. A widely used approach, also treated in my previous post is Tikhonov regularization, that is solving for a small regularization parameter {\alpha>0}

\displaystyle  \min_x\, \|Ax-y^\delta\|_Y^2 + \alpha\|x\|_X^2

Clearly both mother problems have a very similar mathematical structure: We may use the solution operator of the PDE, denote it by {A}, and restate the mother problem of optimal control of PDEs in a form similar to the mother problem of inverse problems. However, there are some important conceptual differences:

Desired state vs. data: In Example 1 {y_d} is a desired state which, however, may not be reachable. In Example 2 {y^\delta} is noisy data and hence, shall not reached as good as possible.

Control vs. solution: In Example 1 the result {u} is an optimal control. It’s form is not of prime importance, as long as it fulfills the given bounds and allows for a good approximation of {y_d}. In Example 2 the result {x} is the approximate solution itself (which, of course shall somehow explain the measured data {y^\delta}). It’s properties are itself important.

Regularization: In Example 1 the regularization is mainly for numerical reasons. The problem itself also has a solution for {\alpha=0}. This is due to the fact that the set of admissible {u} for a weakly compact set. However, in Example 2 one may not choose {\alpha=0}: First because the functional will not have a minimizer anymore and secondly one really does not want {\|Ax-y^\delta\|} as small as possible since {y^\delta} is corrupted by noise. Especially, the people from inverse problems are interested in the case in which both {y^\delta\rightarrow y\in\text{range}A} and {\alpha\rightarrow 0}. However, in optimal control of PDEs, {\alpha} is often seen as a model parameter which ensures that the control has somehow a small energy.

These conceptual difference sometimes complicate the dialog between the fields. One often runs into discussion dead ends like “Why should we care about decaying {\alpha}—it’s given?” or “Why do you need these bounds on {u}? This makes your problem worse and you may not reach to state as good as possible\dots”. It often takes some time until the involved people realize that they really pursue different goals, that the quantities which even have similar names are something different and that the minimization problems can be solved with the same techniques.

In our minisymposium we had the following talks:

  • “Identification of an Unknown Parameter in the Main Part of an Elliptic PDE”, Arnd Rösch
  • “Adaptive Discretization Strategies for Parameter Identification Problems in PDEs in the Context of Tikhonov Type and Newton Type Regularization”, Barbara Kaltenbacher
  • “Optimal Control of PDEs with Directional Sparsity”, Gerd Wachsmuth
  • “Nonsmooth Regularization and Sparsity Optimization” Kazufumi Ito
  • {L^1} Fitting for Nonlinear Parameter Identification Problems for PDEs”, Christian Clason
  • “Simultaneous Identification of Multiple Coefficients in an Elliptic PDE”, Bastian von Harrach

Finally, there was my own talk “Error Estimates for joint Tikhonov and Lavrentiev Regularization of Parameter Identification Probelms” which is based on a paper with the similar name which is at http://arxiv.org/abs/0909.4648 and published in Applicable Analysis. The slides of the presentation are here (beware, there may be some wrong exponents in the pdf…).

In a nutshell, the message of the talk is: Bound both on the control/solution and the state/data may be added also to a Tikhonov-regularized inverse problem. If the operator has convenient mapping properties then the bounds will eventually be inactive if the true solution has the same property. Hence, the known estimates for usual inverse problems are asymptotically recovered.

Some time ago I picked up the phrase Ivanov regularization. Starting with an operator A:X\to Y between to Banach spaces (say) one encounters the problem of instability of the solution of Ax=y if A has non-closed range. One dominant tool to regularize the solution is called Tikhonov regularization and consists of minimizing the functional \|Ax - y^\delta\|_Y^p + \alpha \|x\|_Y^q. The meaning behind these terms is as follows: The term \|Ax -y^\delta \|_Y^p is often called discrepancy and it should be not too large to guarantee, that the “solution” somehow explains the data. The term \|x\|_Y^q is often called regularization functional and shall not be too large to have some meaningful notion of “solution”. The parameter \alpha>0 is called regularization parameter and allows weighting between the discrepancy and regularization.

For the case of Hilbert space one typically chooses p=q=2 and gets a functional for which the minimizer is given more or less explicitly as

x_\alpha = (A^*A + \alpha I)^{-1} A^* y^\delta.

The existence of this explicit solution seems to be one of the main reasons for the broad usage of Tikhonov regularization in the Hilbert space setting.

Another related approach is sometimes called residual method, however, I would prefer the term Morozov regularization. Here one again balances the terms “discrepancy” and “regularization” but in a different way: One solves

\min \|x\|_X\ \text{s.t.}\ \|Ax-y^\delta\|_Y\leq \delta.

That is, one tries to find an x with minimal norm which explains the data y^\delta up to an accuracy \delta. The idea is, that \delta reflects the so called noise level, i.e. an estimate of the error which is made during the measurment of y. One advantage of Morozov regularization over Tikhonov regularization is that the meaning of the parameter \delta>0 is much clearer that the meaning of \alpha>0. However, there is no closed form solution for Morozov regularization.

Ivanov regularization is yet another method: solve

\min \|Ax-y^\delta\|_Y\ \text{s.t.}\ \|x\|_X \leq \tau.

Here one could say, that one wants to have the smallest discrepancy among all x which are not too “rough”.

Ivanov regularization in this form does not have too many appealing properties: The parameter \tau>0 does not seem to have a proper motivation and moreover, there is again no closed form solution.

However, recently the focus of variational regularization (as all these method may be called) has shifted from using norms to the use of more general functionals. For example one considers Tikhonov in an abstract form as minimizing

S(Ax,y^\delta) + \alpha R(x)

with a “general” similarity measure S and a general regularization term R, see e.g. the dissertation of Christiane Pöschl (which can be found here, thanks Christiane) or the works of Jens Flemming. Prominent examples for the similarity measure are of course norms of differences or the Kullback-Leibler divergence or the Itakura-Saito divergence which are both treated in this paper. For the regularization term one uses norms and semi-norms in various spaces, e.g. Sobolev (semi-)norms, Besov (semi-)norms, the total variation seminorm or \ell^p norms.

In all these cases, the advantage of Tikhonov regularization of having a closed form solution is not there anymore. Then, the most natural choice would be, in my opinion, Morozov regularization, because one may use the noise level directly as a parameter. However, from a practical point of view one also should care about the problem of calculating the minimizer of the respective problems. Here, I think that Ivanov regularization is important again: Often the similarity measure S is somehow smooth but the regularization term R is nonsmooth (e.g. for total variation regularization or sparse regularization with \ell^p-penalty). Hence, both Tikhononv and Morozov regularization have a nonsmooth objective function. Somehow, Tikhonov regularization is still a bit easier, since the minimization is unconstrained. Morozov regularization has a constraint which is usually quite difficult to handle. E.g. it is usually difficult (is it probably even ill posed?) to project onto the set defined by S(Ax,y^\delta)\leq \delta. Ivanov regularization has a smooth objective functional (at least if the similarity measure is smooth) and a constraint which is usually somehow simple (i.e. projections are not too difficult to obtain).

Now, I found, that all thee methods, Tikhonov, Morozov and Ivanov regularizazion are all treated in the book “Theory of linear ill-posed problems and its applications” by V. K. Ivanov,V. V. Vasin and Vitaliĭ Pavlovich Tanana in section 3.2, 3.3 and 3.4 respectively. Ivanov regularization goes under the name “method of quasi solutions” (section 3.2) and Morozov regularization is called “Method of residual”(section 3.4). Well, I think I should read these sections a bit closer now…