Regularization


In sparse recovery one aims to leverage the sparsity of some vector {u} to get a good reconstruction of this vector from a noisy and indirect measurement {u^{0}}. Sparsity can be expressed in two different ways: The analysis way and the synthesis way. For both of them you need a system of other vectors {v_{k}}.

  • The analysis way: Here you consider the inner products {\langle u,v_{k}\rangle} and the sparsity assumption is, that this sequence is sparse.
  • The synthesis way: Here you assume that there is sparse sequence of coefficients {\psi_{k}} such that {u=\sum_{k} \psi_{k}v_{k}}.

The first approach is called “analysis” since here one analyzes the vector {u} with the system of vectors. In the second approach, you use the vectors {v_{k}} to synthesize the vector {u}.

We will keep things very simple here and only consider a noisy observation (and not an indirect one). The recovery approach in the analysis way is then to find {u} as a solution of

\displaystyle \tfrac12\|u-u_{0}\| + \alpha R(\langle u,v_{k}\rangle)

while for the synthesis way you write {u = \sum_{k} \psi_{k}v_{k} = V\psi} (where we collected the vectors {v_{k}} as the columns of the matrix {V}) and solve

\displaystyle \tfrac12\|V\psi-u_{0}\| + \alpha R(\psi)

The functional {R} should enforce the sparsity of the vector of coefficients in the former case and the sparsity of {\psi} in the latter case. With the matrix {V} the analysis way reads as

\displaystyle \tfrac12\|u-u_{0}\| + \alpha R(V^{T}u)

One important observation is, that both approaches coincide if the matrix {V} is orthonormal: Since {V^{-1} = V^{T}} in this case, {u=V\psi} is equivalent to {\psi = V^{T}u}, i.e. {\psi_{k} = \langle u,v_{k}\rangle}. For other sets of vectors {v_{k}}, this is not true and it is not so clear, how the analysis and the synthesis approach are related. In this post I’d like to show one instance where both approaches do lead to results that are not even close to being similar.

1. The “ROF analysis” approach to denoising

If {u^{0}} is an image, the famous Rudin-Osher-Fatemi denoising model fits under the “analysis” framework: The vectors {v_{k}} are all the finite differences of neighboring pixels such that {V^{T}u = \nabla u} with the discrete gradient {\nabla u}. As sparsifying functional you take a mixed {2}{1} norm, i.e. {R(\nabla u) = \|\nabla u\|_{2,1} = \sum_{ij}\sqrt{(\partial_{1}u_{ij})^{2} + (\partial_{2}u_{ij})^{2}}}. The denoised {u} is the solution to

\displaystyle \min_{u} \tfrac12\|u-u^{0}\|^{2} + \alpha\|\nabla u\|_{2,1}

Thus, the model will lead to some denoised {u} with a sparse gradient. This sparse gradient is partly the reason for the success of the model in that it keeps edges, but is also responsible for the drawback of this approach: The denoised image will be mostly piesewise constant.

2. The “ROF synthesis” approach

As {V^{T} = \nabla}, the corresponding synthesis approach would be to solve

\displaystyle \min_{\psi}\tfrac12\|\nabla^{T}\psi - u^{0}\|^{2} + \alpha \|\psi\|_{2,1}

and the set {u = \nabla^{T}\psi}. In this approach, the denoised image will be a sparse linear combination of particular two-pixel images. If you think about that, it does not really sound like a good idea to approximate an image by a sparse linear combination of two-pixel images. (One technicality: The operator {\nabla^{T}} is not onto – {\nabla} has a non-trivial kernel, consisting of the constant images and hence, all images in the range of {\nabla^{T}} have zero mean. Thus, to get something remotely close to {u^{0}} one should subtract the mean of {u^{0}} before the reconstruction and add it back afterwards.)

Here are some results:

rof_analysis-_synthesis

 

Note that the results actually fit to what we would have predicted: For larger {\alpha} we get a “sparser gradient” (less edges, less variation) for the analysis approach, and “sparser two-pixel combinations” for the synthesis approach. In fact, for the synthesis approach to give something that is remotely close to the image, one needs quite a lot two-pixel images, i.e. a very low sparsity.

In conclusion, one should not consider the analysis and synthesis approach to be something similar.

Incidentally, the dual of the analysis approach

\displaystyle \min_{u} \tfrac12\|u-u^{0}\|^{2} + \alpha\|\nabla u\|_{2,1}

can be written as

\displaystyle \min_{\|\phi\|_{2,\infty}\leq\alpha}\tfrac12\|\nabla^{T}\phi-u^{0}\|^{2} = \min_{\phi}\tfrac12\|\nabla^{T}\phi-u^{0}\|^{2} + I_{\|\cdot\|_{2,\infty}\leq\alpha}(\phi)

which looks related to the synthesis approach (one basically replaces the {2,1}-norm by its conjugate function). Actually, a similar relation between the dual of the analysis approach and the synthesis approach always holds.

Advertisements

Some remark before you read this post: It is on a very specialized topic and only presents a theoretical insight which seems to be of no practical value whatsoever. Continue at your own risk of wasting your time.

Morozov’s discrepancy principle is a means to choose the regularization parameter when regularizing inverse ill-posed problems. To fix the setting, we consider two Hilbert spaces {X} and {Y} and a bounded linear operator {A:X\rightarrow Y}. We assume that the range of {A} is a non-closed subset of {Y}. As a consequence of the Bounded Inverse Theorem the pseudo-inverse {A^\dag} (defined on {{\mathrm rg} A \oplus ({\mathrm rg} A)^\bot}) is unbounded.

This is the classical setting for linear inverse problems: We have a process, modeled by {A}, such that the quantity {x^\dag\in X} we are interested in gives rise to on output {y^\dag = Ax^\dag}. We are able to measure {y} but, as it is always the case, the is some noise introduced in the measurement process and hence, we have access to a measured quantity {y^\delta\in Y} which is a perturbation of {y}. Our goal is, to approximate {x^\dag} from the knowledge of {y^\delta}. Note that simply taking {A^\dag y^\delta} is not an option, since firstly {y^\delta} is not guaranteed to be in the domain of the pseudo-inverse and, somehow even more severely and also more practical, the unboundedness of {A^\dag} will lead to a severe (in theory infinitely large) amplification of the noise, rendering the reconstruction useless.

The key idea is to approximate the pseudo-inverse {A^\dag} by a family of bounded operators {R_\alpha:Y\rightarrow X} with the hope that one may have

\displaystyle  R_\alpha y^\delta \rightarrow x^\dag\ \text{when}\ y^\delta\rightarrow y\in{\mathrm rg} A\ \text{and}\ \alpha\rightarrow 0 \ \text{appropriately.} \ \ \ \ \ (1)

(Note that the assumption {\alpha\rightarrow 0} is just a convention. It says that we assume that {R_\alpha} is a closer approximation to {A^\dag} the closer {\alpha} is to zero.) Now, we have two tasks:

  1. Construct a good family of regularizing operators {R_\alpha} and
  2. devise a parameter choice, i.e. a way to choose {\alpha}.

The famous Bakushinskii veto says that there is no parameter choice that can guarantee convergence in~(1) in the worst case and only uses the given data {y^\delta}. The situation changes if one introduces knowledge about the noise level {\delta = \|y-y^\delta\|}. (There is an ongoing debate if it is reasonable to assume that the noise level is known – my experience when working with engineers is that they are usually very good in quantifying the noise present in their system and hence, in my view the assumption that noise level is known is ok.)

One popular way to choose the regularization parameter in dependence on {y^\delta} and {\delta} is Morozov’s discrepancy principle:

Definition 1 Morozov’s discrepancy principle states that one shall choose the regularization parameter {\alpha(\delta,y^\delta)} such that

\displaystyle  \|AR_{\alpha(\delta,y^\delta)}y^\delta - y^\delta\| = c\delta

for some fixed {c>1}.

In other words: You shall choose {\alpha} such that the reconstruction {R_\alpha y^\delta} produces a discrepancy {\|AR_{\alpha}y^\delta - y^\delta\|} which is in the order of and slightly larger than the noise level.

Some years ago I wrote a paper about the use of Morozov’s discrepancy principle when using the augmented Lagragian method (aka Bregman iteration) as an iterative regularization method (where one can view the inverse of the iteration counter as a regularization parameter). The paper is Morozov’s principle for the augmented Lagrangian method applied to linear inverse problems (together with , the arxiv link is here). In that paper we derived an estimate for the (squared) error of {R_\alpha y^\delta} and {x^\dag} that behaves like

\displaystyle  C \frac{c(\sqrt{c}+1)}{\sqrt{c-1}}\delta

for some {C>0} and the {c>1} from Morozov’s discrepancy principle. The somehow complicated dependence on {c} was a bit puzzling to me. One can optimize {c>1} such that the error estimate is optimal. It turns out that {c\mapsto \frac{c(\sqrt{c}+1)}{\sqrt{c-1}}} attains the minimal value of about {4.68} for about {c=1.64}. I blamed the already quite complicated analysis of the augmented Lagragian method for this obviously much too complicated values (and in practice, using {c} much closer to {1} usually lead to much better results).

This term I teach a course on inverse problems and also covered Morozov’s discrepancy principle but this time for much simpler regularization methods, namely for linear methods such as, for example, Tikhonov regularization, i.e.

\displaystyle  R_\alpha y^\delta = (A^* A + \alpha I)^{-1} A^* y^\delta

(but other linear methods exist). There I arrived at an estimate for the (squared) error of {R_\alpha y^\delta} any {x^\dag} of the form

\displaystyle  C(\sqrt{c} + \tfrac1{\sqrt{c-1}})^2\delta.

Surprisingly, the dependence on {c} for this basic regularization method is also not very simple. Optimizing the estimate over {c>1} leads to an optimal value of about {c= 2.32} (with a minimal value of the respective constant of about {5.73}). Again, using {c} closer to {1} usually lead to much better results.

Well, these observations are not of great importance… I just found it curious to observe that the analysis of Morozov’s discrepancy principle seems to be inherently a bit more complicated than I thought…

Today I’d like to collect some comments one a few papers I stumbled upon recently on the arXiv.

1. TGV minimizers in 1D

First, about a month ago two very similar paper appeared in the same week:

Both papers treat the recently proposed “total generalized variation” model (which is a somehow-but-not-really-higher-order generalization of total variation). The total variation of a function {u\in L^1(\Omega)} ({\Omega\subset{\mathbb R}^d}) is defined by duality

\displaystyle  TV(u) = \sup\Big\{\int_\Omega \mathrm{div} \phi\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,{\mathbb R}^d), |\phi|\leq 1\Big\}.

(Note that the demanded high regularity of the test functions {\phi} is not essential here, as we take a supremum over all these functions under the only, but important, requirement that the functions are bounded. Test functions from {C^1_c(\Omega,{\mathbb R}^d)} would also do.)

Several possibilities for extensions and generalization of the total variation exist by somehow including higher order derivatives. The “total generalized variation” is a particular successful approach which reads as (now using two non-negative parameter {\alpha,\beta} which do a weighting):

\displaystyle  TGV_{\beta,\alpha}^2(u) = \sup\Big\{\int_\Omega \mathrm{div}^2 \phi\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,S^{d\times d}),\ |\phi|\leq \beta,\ |\mathrm{div}\phi|\leq \alpha\Big\}.

To clarify some notation: {S^{d\times d}} are the symmetric {d\times d} matrices, {\mathrm{div}^n} is the negative adjoint of {\nabla^n} which is the differential operator that collects all partial derivatives up to the {n}-th order in a {d\times\cdots\times d}-tensor. Moreover {|\phi|} is some matrix norm (e.g. the Frobenius norm) and {|\mathrm{div}\phi|} is some vector norm (e.g. the 2-norm).

Both papers investigate so called denoising problems with TGV penalty and {L^2} discrepancy, i.e. minimization problems

\displaystyle  \min_u \frac12\int_\Omega(u-u^0)^2\, dx + TGV_{\alpha,\beta}^2(u)

for a given {u^0}. Moreover, both papers treat the one dimensional case and investigate very special cases in which they calculate minimizers analytically. In one dimension the definition of {TGV^2} becomes a little more familiar:

\displaystyle  TGV_{\beta,\alpha}^2(u) = \sup\Big\{\int_\Omega \phi''\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,{\mathbb R}),\ |\phi|\leq \beta,\ |\phi'|\leq \alpha\Big\}.

Some images of both papar are really similar: This one from Papafitsoros and Bredies

068_TGV_hat1

and this one from Pöschl and Scherzer
068_TGV_hat2

Although both paper have a very similar scopes it is worth to read both. The calculations are tedious but both paper try to make them accessible and try hard (and did a good job) to provide helpful illustrations. Curiously, the earlier paper cites the later one but not conversely…

2. Generalized conditional gradient methods

Another paper I found very interesting was

This paper shows a nice duality which I haven’t been aware of, namely the one between the subgradient descent methods and conditional gradient methods. In fact the conditional gradient method which is treated is a generalization of the conditional gradient method which Kristian and I also proposed a while ago in the context of {\ell^1}-minimization in the paper Iterated hard shrinkage for minimization problems with sparsity constraints: To minimize the sum

\displaystyle  F(u) + \Phi(u)

with a differentiable {F} and a convex {\Phi} for which the subgradient of {\Phi} is easily invertible (or, put differently, for which you can minimize {\langle u,a\rangle + \Phi(u)} easily), perform the following iteration:

  1. At iterate {u^n} linearize {F} but not {\Phi} and calculate a new point {v^n} by

    \displaystyle  v^n = \mathrm{argmin}_v \langle F'(u^n),v\rangle + \Phi(v)

  2. Choose a stepsize {s^n\in [0,1]} and set the next iterate as a convex combination of {u^n} and {v^n}

    \displaystyle  u^{n+1} = u^n + s_n(v^n - u^n).

Note that for and indicator function

\displaystyle  \Phi(u) = \begin{cases} 0 & u\in C\\ \infty & \text{else} \end{cases}

you obtain the conditional gradient method (also known as Frank-Wolfe method). While Kristian and I derived convergence with an asymptotic rate for the case of {F(u) = \tfrac12\|Ku-f\|^2} and {\Phi} strongly coercive, Francis uses the formulation {F(u) = f(Au)} the assumption that the dual {f^*} of {f} has a bounded effective domain (which say that {f} has linear growth in all directions). With this assumption he obtains explicit constants and rates also for the primal-dual gap. It was great to see that eventually somebody really took the idea from the paper Iterated hard shrinkage for minimization problems with sparsity constraints (and does not think that we do heuristics for {\ell^0} minimization…).

A quick post to keep track of several things:

Another few notes to myself:

If you are working on optimization with partial differential equations as constraints, you may be interested in the website

“OPTPDE – A Collection of Problems in PDE-Constrained Optimization”, http://www.optpde.net.

If you have developed an algorithm which can handle a certain class of optimization problems you need to do evaluations and tests on how well the method performs. To do so, you need well constructed test problems. This could be either problems where the optimal solution is known analytically our problems where the solution is known with a rigorous error bound obtained with a bullet-proof solver. Both things are not always easy to obtain and OPTPDE shall serve as a resource for such problems. It has been designed by Roland Herzog, Arnd Rösch, Stefan Ulbrich and Winnifried Wollner.

The generation of test instance for optimization problems seems quite important to me and indeed, several things can go wrong if this is not done right. Frequently, one sees tests for optimization routines on problems where the optimal solution is not known. Since there are usually different ways to express optimality conditions it is not always clear how to check for optimality; even more so, if you only check for “approximate optimality”, e.g. up to machine precision. A frequently observed effect is a kind of “trusted method bias”. By this I mean that an optimal solution is calculated by some trusted method and comparing the outcome of the tested routine with this solution. However, the trusted method uses some stopping criterion usually based on some specific set of formulations of optimality conditions and these can be different from what the new method has been tuned to. And most often, the stopping criteria do not give a rigorous error bound for the solution or the optimal objective value.

For sparse reconstruction problems, I dealt with this issue in “Constructing test instance for Basis Pursuit Denoising” (preprint available here) but I think this methodology could be used for other settings as well.

Today there are several things I could blog on. The first is the planary by Rich Baraniuk on Compressed Sensing. However, I don’t think that I could reflect the content in a way which would be helpful for a potential reader. Just for the record: If you have the chance to visit one of Rich’s talk: Do it!

The second thing is the talk by Bernd Hofmann on source conditions, smoothness and variational inequalities and their use in regularization of inverse problems. However, this would be too technical for now and I just did not take enough notes to write a meaningful post.

As a third thing I have the talk by Christian Clason on inverse problems with uniformly distributed noise. He argued that for uniform noise it is much better to use an {L^\infty} discrepancy term instead of the usual {L^2}-one. He presented a path-following semismooth Newton method to solve the problem

\displaystyle \min_x \frac{1}{p}\|Kx-y^\delta\|_\infty^p + \frac{\alpha}{2}\|x\|_2^2

and showed examples with different kinds of noise. Indeed the examples showed that {L^\infty} works much better than {L^2} here. But in fact it works even better, if the noise is not uniformly distributed but “impulsive” i.e. it attains bounds {\pm\delta} almost everywhere. It seems to me that uniform noise would need a slightly different penalty but I don’t know which one – probably you do? Moreover, Christian presented the balancing principle to choose the regularization parameter (without knowledge about the noise level) and this was the first time I really got what it’s about. What one does here is, to choose {\alpha} such that (for some {\sigma>0} which only depends on {K}, but not on the noise)

\displaystyle \sigma\|Kx_\alpha^\delta-y^\delta\|_\infty = \frac{\alpha}{2}\|x_\alpha^\delta\|_2^2.

The rational behind this is, that the left hand side is monotonically non-decreasing in {\alpha}, while the right hand side is monotonically non-increasing. Hence, there should be some {\alpha} “in the middle” which make both somewhat equally large. Of course, we do neither want to “over-regularize” (which would usually “smooth too much”) nor to “under-regularize” (which would not eliminate noise). Hence, balancing seems to be a valid choice. From a practical point of view the balancing is also nice because one can use the fixed-point iteration

\displaystyle \alpha^{n+1} = 2\sigma\frac{\|Kx_{\alpha^n}^\delta - y^\delta\|_\infty}{\|x_{\alpha_n}^\delta\|_2^2}

which converges in a few number of iterations.

Then there was the talk by Esther Klann, but unfortunately, I was late so only heard the last half…

Last but not least we have the talk by Christiane Pöschl. If you are interested in Total-Variation-Denoising (TV denoising), then you probably have heard many times that “TV denoising preserves edges” (have a look at the Wikipedia page – it claims this twice). What Christiane showed (in a work with Vicent Caselles and M. Novaga) that this claim is not true in general but only for very special cases. In case of characteristic functions, the only functions for which the TV minimizer has sharp edges are these so-called calibrated sets, introduced by Caselles et el. Building on earlier works by Caselles and co-workers she calculated exact minimizers for TV denoising in the case that the image consists of characteristic functions of two convex sets or of a single star shaped domain, that is, for a given set B she calculated the solution of

\displaystyle \min_u\int (u - \chi_B)^2dx + \lambda \int|Du|.

This is not is as easy as it may sound. Even for the minimizer for a single convex set one has to make some effort. She presented a nice connection of the shape of the obtained level-sets with the morphological operators of closing and opening. With the help of this link she derived a methodology to obtain the exact TV denoising minimizer for all parameters. I do not have the images right now but be assured that most of the time, the minimizers do not have sharp edges all over the place. Even for simple geometries (like two rectangles touching in a corner) strange things happen and only very few sharp edges appear. I’ll keep you posted in case the paper comes out (or appears as a preprint).

Christiane has some nice images which make this much more clear:

For two circles edges are preserved if they are far enough away from each other. If they are close, the area “in between” them is filled and, moreover, obey this fuzzy boundary. I remember myself seeing effects like this in the output of TV-solvers and thinking “well, it seems that the algorithm is either not good or not converged yet – TV should output sharp edges!”.

 

For a star-shaped shape (well, actually a star) the output looks like this. The corners are not only rounded but also blurred and this is true both for the “outer” corners and the “inner” corners.

 

So, if you have any TV-minimizing code, go ahead and check if your code actually does the right things on images like this!
Moreover, I would love to see similar results for more complicated extensions of TV like Total Generalized Variation, I treated here.

 

 

 

 

Next Page »