Regularization


Some remark before you read this post: It is on a very specialized topic and only presents a theoretical insight which seems to be of no practical value whatsoever. Continue at your own risk of wasting your time.

Morozov’s discrepancy principle is a means to choose the regularization parameter when regularizing inverse ill-posed problems. To fix the setting, we consider two Hilbert spaces {X} and {Y} and a bounded linear operator {A:X\rightarrow Y}. We assume that the range of {A} is a non-closed subset of {Y}. As a consequence of the Bounded Inverse Theorem the pseudo-inverse {A^\dag} (defined on {{\mathrm rg} A \oplus ({\mathrm rg} A)^\bot}) is unbounded.

This is the classical setting for linear inverse problems: We have a process, modeled by {A}, such that the quantity {x^\dag\in X} we are interested in gives rise to on output {y^\dag = Ax^\dag}. We are able to measure {y} but, as it is always the case, the is some noise introduced in the measurement process and hence, we have access to a measured quantity {y^\delta\in Y} which is a perturbation of {y}. Our goal is, to approximate {x^\dag} from the knowledge of {y^\delta}. Note that simply taking {A^\dag y^\delta} is not an option, since firstly {y^\delta} is not guaranteed to be in the domain of the pseudo-inverse and, somehow even more severely and also more practical, the unboundedness of {A^\dag} will lead to a severe (in theory infinitely large) amplification of the noise, rendering the reconstruction useless.

The key idea is to approximate the pseudo-inverse {A^\dag} by a family of bounded operators {R_\alpha:Y\rightarrow X} with the hope that one may have

\displaystyle  R_\alpha y^\delta \rightarrow x^\dag\ \text{when}\ y^\delta\rightarrow y\in{\mathrm rg} A\ \text{and}\ \alpha\rightarrow 0 \ \text{appropriately.} \ \ \ \ \ (1)

(Note that the assumption {\alpha\rightarrow 0} is just a convention. It says that we assume that {R_\alpha} is a closer approximation to {A^\dag} the closer {\alpha} is to zero.) Now, we have two tasks:

  1. Construct a good family of regularizing operators {R_\alpha} and
  2. devise a parameter choice, i.e. a way to choose {\alpha}.

The famous Bakushinskii veto says that there is no parameter choice that can guarantee convergence in~(1) in the worst case and only uses the given data {y^\delta}. The situation changes if one introduces knowledge about the noise level {\delta = \|y-y^\delta\|}. (There is an ongoing debate if it is reasonable to assume that the noise level is known – my experience when working with engineers is that they are usually very good in quantifying the noise present in their system and hence, in my view the assumption that noise level is known is ok.)

One popular way to choose the regularization parameter in dependence on {y^\delta} and {\delta} is Morozov’s discrepancy principle:

Definition 1 Morozov’s discrepancy principle states that one shall choose the regularization parameter {\alpha(\delta,y^\delta)} such that

\displaystyle  \|AR_{\alpha(\delta,y^\delta)}y^\delta - y^\delta\| = c\delta

for some fixed {c>1}.

In other words: You shall choose {\alpha} such that the reconstruction {R_\alpha y^\delta} produces a discrepancy {\|AR_{\alpha}y^\delta - y^\delta\|} which is in the order of and slightly larger than the noise level.

Some years ago I wrote a paper about the use of Morozov’s discrepancy principle when using the augmented Lagragian method (aka Bregman iteration) as an iterative regularization method (where one can view the inverse of the iteration counter as a regularization parameter). The paper is Morozov’s principle for the augmented Lagrangian method applied to linear inverse problems (together with , the arxiv link is here). In that paper we derived an estimate for the (squared) error of {R_\alpha y^\delta} and {x^\dag} that behaves like

\displaystyle  C \frac{c(\sqrt{c}+1)}{\sqrt{c-1}}\delta

for some {C>0} and the {c>1} from Morozov’s discrepancy principle. The somehow complicated dependence on {c} was a bit puzzling to me. One can optimize {c>1} such that the error estimate is optimal. It turns out that {c\mapsto \frac{c(\sqrt{c}+1)}{\sqrt{c-1}}} attains the minimal value of about {4.68} for about {c=1.64}. I blamed the already quite complicated analysis of the augmented Lagragian method for this obviously much too complicated values (and in practice, using {c} much closer to {1} usually lead to much better results).

This term I teach a course on inverse problems and also covered Morozov’s discrepancy principle but this time for much simpler regularization methods, namely for linear methods such as, for example, Tikhonov regularization, i.e.

\displaystyle  R_\alpha y^\delta = (A^* A + \alpha I)^{-1} A^* y^\delta

(but other linear methods exist). There I arrived at an estimate for the (squared) error of {R_\alpha y^\delta} any {x^\dag} of the form

\displaystyle  C(\sqrt{c} + \tfrac1{\sqrt{c-1}})^2\delta.

Surprisingly, the dependence on {c} for this basic regularization method is also not very simple. Optimizing the estimate over {c>1} leads to an optimal value of about {c= 2.32} (with a minimal value of the respective constant of about {5.73}). Again, using {c} closer to {1} usually lead to much better results.

Well, these observations are not of great importance… I just found it curious to observe that the analysis of Morozov’s discrepancy principle seems to be inherently a bit more complicated than I thought…

Today I’d like to collect some comments one a few papers I stumbled upon recently on the arXiv.

1. TGV minimizers in 1D

First, about a month ago two very similar paper appeared in the same week:

Both papers treat the recently proposed “total generalized variation” model (which is a somehow-but-not-really-higher-order generalization of total variation). The total variation of a function {u\in L^1(\Omega)} ({\Omega\subset{\mathbb R}^d}) is defined by duality

\displaystyle  TV(u) = \sup\Big\{\int_\Omega \mathrm{div} \phi\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,{\mathbb R}^d), |\phi|\leq 1\Big\}.

(Note that the demanded high regularity of the test functions {\phi} is not essential here, as we take a supremum over all these functions under the only, but important, requirement that the functions are bounded. Test functions from {C^1_c(\Omega,{\mathbb R}^d)} would also do.)

Several possibilities for extensions and generalization of the total variation exist by somehow including higher order derivatives. The “total generalized variation” is a particular successful approach which reads as (now using two non-negative parameter {\alpha,\beta} which do a weighting):

\displaystyle  TGV_{\beta,\alpha}^2(u) = \sup\Big\{\int_\Omega \mathrm{div}^2 \phi\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,S^{d\times d}),\ |\phi|\leq \beta,\ |\mathrm{div}\phi|\leq \alpha\Big\}.

To clarify some notation: {S^{d\times d}} are the symmetric {d\times d} matrices, {\mathrm{div}^n} is the negative adjoint of {\nabla^n} which is the differential operator that collects all partial derivatives up to the {n}-th order in a {d\times\cdots\times d}-tensor. Moreover {|\phi|} is some matrix norm (e.g. the Frobenius norm) and {|\mathrm{div}\phi|} is some vector norm (e.g. the 2-norm).

Both papers investigate so called denoising problems with TGV penalty and {L^2} discrepancy, i.e. minimization problems

\displaystyle  \min_u \frac12\int_\Omega(u-u^0)^2\, dx + TGV_{\alpha,\beta}^2(u)

for a given {u^0}. Moreover, both papers treat the one dimensional case and investigate very special cases in which they calculate minimizers analytically. In one dimension the definition of {TGV^2} becomes a little more familiar:

\displaystyle  TGV_{\beta,\alpha}^2(u) = \sup\Big\{\int_\Omega \phi''\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,{\mathbb R}),\ |\phi|\leq \beta,\ |\phi'|\leq \alpha\Big\}.

Some images of both papar are really similar: This one from Papafitsoros and Bredies

068_TGV_hat1

and this one from Pöschl and Scherzer
068_TGV_hat2

Although both paper have a very similar scopes it is worth to read both. The calculations are tedious but both paper try to make them accessible and try hard (and did a good job) to provide helpful illustrations. Curiously, the earlier paper cites the later one but not conversely…

2. Generalized conditional gradient methods

Another paper I found very interesting was

This paper shows a nice duality which I haven’t been aware of, namely the one between the subgradient descent methods and conditional gradient methods. In fact the conditional gradient method which is treated is a generalization of the conditional gradient method which Kristian and I also proposed a while ago in the context of {\ell^1}-minimization in the paper Iterated hard shrinkage for minimization problems with sparsity constraints: To minimize the sum

\displaystyle  F(u) + \Phi(u)

with a differentiable {F} and a convex {\Phi} for which the subgradient of {\Phi} is easily invertible (or, put differently, for which you can minimize {\langle u,a\rangle + \Phi(u)} easily), perform the following iteration:

  1. At iterate {u^n} linearize {F} but not {\Phi} and calculate a new point {v^n} by

    \displaystyle  v^n = \mathrm{argmin}_v \langle F'(u^n),v\rangle + \Phi(v)

  2. Choose a stepsize {s^n\in [0,1]} and set the next iterate as a convex combination of {u^n} and {v^n}

    \displaystyle  u^{n+1} = u^n + s_n(v^n - u^n).

Note that for and indicator function

\displaystyle  \Phi(u) = \begin{cases} 0 & u\in C\\ \infty & \text{else} \end{cases}

you obtain the conditional gradient method (also known as Frank-Wolfe method). While Kristian and I derived convergence with an asymptotic rate for the case of {F(u) = \tfrac12\|Ku-f\|^2} and {\Phi} strongly coercive, Francis uses the formulation {F(u) = f(Au)} the assumption that the dual {f^*} of {f} has a bounded effective domain (which say that {f} has linear growth in all directions). With this assumption he obtains explicit constants and rates also for the primal-dual gap. It was great to see that eventually somebody really took the idea from the paper Iterated hard shrinkage for minimization problems with sparsity constraints (and does not think that we do heuristics for {\ell^0} minimization…).

A quick post to keep track of several things:

Another few notes to myself:

If you are working on optimization with partial differential equations as constraints, you may be interested in the website

“OPTPDE – A Collection of Problems in PDE-Constrained Optimization”, http://www.optpde.net.

If you have developed an algorithm which can handle a certain class of optimization problems you need to do evaluations and tests on how well the method performs. To do so, you need well constructed test problems. This could be either problems where the optimal solution is known analytically our problems where the solution is known with a rigorous error bound obtained with a bullet-proof solver. Both things are not always easy to obtain and OPTPDE shall serve as a resource for such problems. It has been designed by Roland Herzog, Arnd Rösch, Stefan Ulbrich and Winnifried Wollner.

The generation of test instance for optimization problems seems quite important to me and indeed, several things can go wrong if this is not done right. Frequently, one sees tests for optimization routines on problems where the optimal solution is not known. Since there are usually different ways to express optimality conditions it is not always clear how to check for optimality; even more so, if you only check for “approximate optimality”, e.g. up to machine precision. A frequently observed effect is a kind of “trusted method bias”. By this I mean that an optimal solution is calculated by some trusted method and comparing the outcome of the tested routine with this solution. However, the trusted method uses some stopping criterion usually based on some specific set of formulations of optimality conditions and these can be different from what the new method has been tuned to. And most often, the stopping criteria do not give a rigorous error bound for the solution or the optimal objective value.

For sparse reconstruction problems, I dealt with this issue in “Constructing test instance for Basis Pursuit Denoising” (preprint available here) but I think this methodology could be used for other settings as well.

Today there are several things I could blog on. The first is the planary by Rich Baraniuk on Compressed Sensing. However, I don’t think that I could reflect the content in a way which would be helpful for a potential reader. Just for the record: If you have the chance to visit one of Rich’s talk: Do it!

The second thing is the talk by Bernd Hofmann on source conditions, smoothness and variational inequalities and their use in regularization of inverse problems. However, this would be too technical for now and I just did not take enough notes to write a meaningful post.

As a third thing I have the talk by Christian Clason on inverse problems with uniformly distributed noise. He argued that for uniform noise it is much better to use an {L^\infty} discrepancy term instead of the usual {L^2}-one. He presented a path-following semismooth Newton method to solve the problem

\displaystyle \min_x \frac{1}{p}\|Kx-y^\delta\|_\infty^p + \frac{\alpha}{2}\|x\|_2^2

and showed examples with different kinds of noise. Indeed the examples showed that {L^\infty} works much better than {L^2} here. But in fact it works even better, if the noise is not uniformly distributed but “impulsive” i.e. it attains bounds {\pm\delta} almost everywhere. It seems to me that uniform noise would need a slightly different penalty but I don’t know which one – probably you do? Moreover, Christian presented the balancing principle to choose the regularization parameter (without knowledge about the noise level) and this was the first time I really got what it’s about. What one does here is, to choose {\alpha} such that (for some {\sigma>0} which only depends on {K}, but not on the noise)

\displaystyle \sigma\|Kx_\alpha^\delta-y^\delta\|_\infty = \frac{\alpha}{2}\|x_\alpha^\delta\|_2^2.

The rational behind this is, that the left hand side is monotonically non-decreasing in {\alpha}, while the right hand side is monotonically non-increasing. Hence, there should be some {\alpha} “in the middle” which make both somewhat equally large. Of course, we do neither want to “over-regularize” (which would usually “smooth too much”) nor to “under-regularize” (which would not eliminate noise). Hence, balancing seems to be a valid choice. From a practical point of view the balancing is also nice because one can use the fixed-point iteration

\displaystyle \alpha^{n+1} = 2\sigma\frac{\|Kx_{\alpha^n}^\delta - y^\delta\|_\infty}{\|x_{\alpha_n}^\delta\|_2^2}

which converges in a few number of iterations.

Then there was the talk by Esther Klann, but unfortunately, I was late so only heard the last half…

Last but not least we have the talk by Christiane Pöschl. If you are interested in Total-Variation-Denoising (TV denoising), then you probably have heard many times that “TV denoising preserves edges” (have a look at the Wikipedia page – it claims this twice). What Christiane showed (in a work with Vicent Caselles and M. Novaga) that this claim is not true in general but only for very special cases. In case of characteristic functions, the only functions for which the TV minimizer has sharp edges are these so-called calibrated sets, introduced by Caselles et el. Building on earlier works by Caselles and co-workers she calculated exact minimizers for TV denoising in the case that the image consists of characteristic functions of two convex sets or of a single star shaped domain, that is, for a given set B she calculated the solution of

\displaystyle \min_u\int (u - \chi_B)^2dx + \lambda \int|Du|.

This is not is as easy as it may sound. Even for the minimizer for a single convex set one has to make some effort. She presented a nice connection of the shape of the obtained level-sets with the morphological operators of closing and opening. With the help of this link she derived a methodology to obtain the exact TV denoising minimizer for all parameters. I do not have the images right now but be assured that most of the time, the minimizers do not have sharp edges all over the place. Even for simple geometries (like two rectangles touching in a corner) strange things happen and only very few sharp edges appear. I’ll keep you posted in case the paper comes out (or appears as a preprint).

Christiane has some nice images which make this much more clear:

For two circles edges are preserved if they are far enough away from each other. If they are close, the area “in between” them is filled and, moreover, obey this fuzzy boundary. I remember myself seeing effects like this in the output of TV-solvers and thinking “well, it seems that the algorithm is either not good or not converged yet – TV should output sharp edges!”.

 

For a star-shaped shape (well, actually a star) the output looks like this. The corners are not only rounded but also blurred and this is true both for the “outer” corners and the “inner” corners.

 

So, if you have any TV-minimizing code, go ahead and check if your code actually does the right things on images like this!
Moreover, I would love to see similar results for more complicated extensions of TV like Total Generalized Variation, I treated here.

 

 

 

 

In this post I gladly announce that three problems that bothered me have been solved: The computational complexity of certifying RIP and NSP and the number of steps the homotopy method needs to obtain a solution of the Basis Pursuit problem.

1. Complexity of RIP and NSP

On this issue we have two papers:

The first paper has the more general results and hence, we start with the second one: The main result of the second paper is this:

Theorem 1 Let a matrix {A}, a positive integer {K} and some {0<\delta<1} be given. It is hard for NP under randomized polynomial-time reductions to check if {A} satisfies the {(K,\delta)} restricted isometry property.

That does not yet say that it’s NP-hard to check if {\delta} is an RIP constant for {K}-sparse vectors but it’s close. I think that Dustin Mixon has explained this issue better on his blog than I could do here.

In the first paper (which is, by the way, on outcome of the SPEAR-project in which I am involved…) the main result is indeed the conjectured NP-hardness of calculating RIP constants:

Theorem 2 For a given matrix {A} and a positive integer {K}, it is NP-hard to compute the restricted isometry constant.

Moreover, this is just a corollary to the main theorem of that paper which reads as

Theorem 3 For a given matrix {A} and a positive integer {K}, the problem to decide whether {A} satisfies the restricted isometry property of order {K} for some constant {\delta<1} is coNP-complete.

They also provide a slightly strengthened version of Theorem~1:

Theorem 4 Let a matrix {A}, a positive integer {K} and some {0<\delta<1} be given. It is coNP-complete to check if {A} satisfies the {(K,\delta)} restricted isometry property.

Moreover, the paper by Pfetsch and Tillmann also proves something about the null space property (NSP):

Definition 5 A matrix {A} satisfies the null space property of order {K} if there is a constant {\alpha>0} such that for all elements {x} in the null space of {A} it holds that the sum of the {K} largest absolute values of {x} is smaller that {\alpha} times the 1-norm of {x}. The smallest such constant {\alpha} is called the null space constant of order {K}.

Their main result is as follows:

Theorem 6 F or a given matrix {A} and a positive integer {K}, the problem to decide whether {A} satisfies the null space property order {K} for some constant {\alpha<1} is coNP-complete. Consequently, it is NP-hard to compute the null space constant of {A}.

2. Complexity of the homotopy method for Basis Pursuit

The second issue is about the basis pursuit problem

\displaystyle  \min_x \|x\|_1\quad\text{s.t.}\ Ax=b.

which can be approximated by the “denoising variant”

\displaystyle  \min_x \lambda\|x\|_1 + \tfrac12\|Ax-b\|_2^2.

What is pretty interesting about the denoising variant is, that the solution {x(\lambda)} (if it is unique throughout) depends on {\lambda} in a piecewise linear way and converges to the solution of basis pursuit for {\lambda\rightarrow 0}. This leads to an algorithm for the solution of basis pursuit: Start with {\lambda=\|A^Tb\|_\infty} (for which the unique solution is {x(\lambda)=0}), calculate the direction of the “solution path”, follow it until you reach a “break point”, calculate the next direction and so on until {\lambda} hits zero. This is for example implemented for MATLAB in L1Homotopy (the SPAMS package also seems to have this implemented, however, I haven’t used it yet). In practice, this approach (usually called homotopy method) is pretty fast and moreover, only detects a few break points. However, an obvious upper bound on the number of break point is exponential in the number of entries in {x}. Hence, it seemed that one was faced with a situation similar to the simplex method for linear programming: The algorithms performs great an average but the worst case complexity is bad. That this is really true for linear programming is known since some time by the Klee-Minty example, an example for which the simplex method takes an exponential number of steps. What I asked myself for some time: Is there a Klee-Minty example for the homotopy method?

Now the answer is there: Yes, there is!

The denoising variant of basis pursuit is also known as LASSO regularization in the statistics literature and this explains the title of the paper which comes up with the example:

Julien and Bin investigate the number of linear segments in the regularization path and first observe that this is upper bounded by {(3^p+1)/2} is {p} is the number of entries in {x} (i.e. the number of variables of the problem). Then they try to construct an instance that matches this upper bound. They succeed in a clever way: For a given instance {(A,b)} with a path with {k} linear segments they try to construct an instance which has one more variable such that the number of linear segments in increased by a factor. Their result goes like this:

Theorem 7 Let {A\in{\mathbb R}^{n\times p}} have full rank and let {b\in{\mathbb R}^n} be in the range of {A}. Assume that the homotopy path has {k} linear segments and denote by {\lambda_1} the regularization parameter which corresponds to the smallest kink in the path. Now choose {b_{n+1}\neq 0} and {\alpha} such that

\displaystyle   0<\alpha < \frac{\lambda_1}{2\|b\|_2^2 + b_{n+1}^2} \ \ \ \ \ (1)

and define {\tilde b\in{\mathbb R}^{n+1}} and {\tilde A\in{\mathbb R}{(n+1)\times (p+1)}} by

\displaystyle  \tilde b = \begin{bmatrix} b\\ b_{n+1} \end{bmatrix}, \quad \tilde A = \begin{bmatrix} A & 2\alpha b\\ 0 & \alpha b_{n+1} \end{bmatrix}.

Then the homotopy path for the basis pursuit problem with matrix {\tilde A} and right hand side {\tilde b} has {3k-1} linear segments.

With this theorem at hand, it is straightforward to recursively build a “Mairal-Yu”-example which matches the upper bound for the number of linear segments. The idea is to start with a {1\times 1} example and let it grow by one row and one column according to Theorem~7. We start with the simplest {1\times 1} example, namely {A = [1]} and {b=[1]}. To move to the next bigger example you can choose the next entry {b_{n+1}} and we always choose {1} for convenience. Moreover, you need the next {\alpha} and you need to know the smallest kink in the path. I calculated the paths and kinks with L1Packv2 by Ignace Loris because it is written in Mathematica and can use exact arithmetics with rational numbers (and you will see, that accuracy will be an issue even for small instances) and seemed bullet proof for me. Let’s see where this idea brings us:

Example 1 (Mairal-Yu example)

  • Stage 1: We start with {n=p=1}, {b=[1]} and {A=[1]}. The homotopy path has one kink at {\lambda_1=1} (with corresponding solution {[0]}) and hence, two linear segments. Now let’s go to the next larger instance:
  • Stage 2: We can choose the entry {b_2} as we like and choose it equals to 1, i.e. our new {b} is

    \displaystyle  b = \begin{bmatrix} 1\\1 \end{bmatrix}.

    Now we have to choose {\alpha} according to (1), i.e

    \displaystyle  0 < \alpha < \frac{\lambda_1}{2\|b\|_2^2 + b_{n+1}^2} = \frac{1}{2+1} = \frac{1}{3}

    and we can choose, e.g., {\alpha = 1/4} which gives our new matrix

    \displaystyle  A = \begin{bmatrix} 1 & \frac12\\ 0 & \frac14 \end{bmatrix}.

    The calculation of the new regularization path shows that it has exactly the announced number of 5 segments and the parameter of the smallest kink is {\lambda_1 = \frac{1}{13}}.

  • Stage 2: Again we choose {b_{n+1} = 1} giving

    \displaystyle  b = \begin{bmatrix} 1\\1\\1 \end{bmatrix}

    For the choice of {\alpha} we need that

    \displaystyle  0<\alpha < \frac{1}{13(4+1)} = \frac{1}{75}

    and we may choose

    \displaystyle  \alpha = \frac1{80}.

    which gives the next matrix

    \displaystyle  A = \begin{bmatrix} 1 & \frac12 & \tfrac{1}{40}\\ 0 & \frac14 & \tfrac{1}{40}\\ 0 & 0 & \tfrac{1}{80} \end{bmatrix}.

    We calculate the regularization path, observe that it has the predicted 14 segments and that the parameter of the smallest kink is {\lambda_1 = \frac{1}{193}}.

  • Stage 3: Again we choose {b_{n+1} = 1} giving

    \displaystyle  b = \begin{bmatrix} 1\\1\\1\\1 \end{bmatrix}

    For the choice of {\alpha} we need that

    \displaystyle  0<\alpha < \frac{1}{193(6+1)} = \frac{1}{1351}

    and we see that things are getting awkward here…

Proceeding in this way we always increase the number of linear segments {k_n} for the {n\times n}-case from {k_n} to {k_{n+1} = 3k_n-1} in each step and one checks easily that this leads to {k_n = (3^n+1)/2} which is the worst case! If you are interested in the regularization path: I produced picture for the first three dimensions (well, I could not draw a 4d {\ell^1}-ball) and here they are:

1d Mairal-Yu example


2d Mairal-Yu example


3d Mairal-Yu example

It is not really easy to perceive the whole paths from the pictures because the magnitude of the entries vary strongly. I’ve drawn the path in red, each kink marked with a small circle. Moreover, I have drawn the according {\ell^1}-balls of the respective radii to provide more geometric information.

The paper by Mairal and Yu has more results of the paths if one looks for approximate solutions of the linear system but I will not go into detail about them here.

At least two questions come to mind:

  • The Mairal-Yu example is {n\times n}. What is the worst case complexity for the true rectangular case? In other words: What is the complexity for {p\times n} in terms of {p} and {n}?
  • The example and the construction leads to matrices that does not have normed columns and moreover, they are far from being equal in norm. But matrices with normed columns seem to be more “well behaved”. Does the worst case complexity lowers if the consider matrices with unit-norm columns? Probably one can construct a unit-norm example by proper choice of {b}

Today I’d like to blog about two papers which appeared on the arxiv.

1. Regularization with the Augmented Lagrangian Method – Convergence Rates from Variational Inequalities

The first one is the paper “Regularization of Linear Ill-posed Problems by the Augmented Lagrangian Method and Variational Inequalities” by Klaus Frick and Markus Grasmair.

Well, the title basically describes the content quite accurate. However, recall that the Augmented Lagrangian Method (ALM) is a method to calculate solutions to certain convex optimization problems. For a convex, proper and lower-semicontinuous function {J} on a Banach space {X}, a linear and bounded operator {K:X\rightarrow H} from {X} into a Hilbert space {H} and an element {g\in H} consider the problem

\displaystyle   \inf_{u} J(u)\quad\text{s.t.}\quad Ku=g. \ \ \ \ \ (1)

The ALM goes as follows: Start with an initial dual variable {p_0}, choose step-sizes {\tau_k>0} and iterate

\displaystyle  u_k \in \text{argmin}\Big(\frac{\tau_k}{2}\|Ku-g\|^2 + J(u) + \langle p_{k-1},Ku-g\rangle\Big)

\displaystyle  p_k = p_{k-1}+\tau_k(g-Ku_k).

(These days one should note that this iteration is also known under the name Bregman iteration…). Indeed, it is known that the ALM converges to a solution of (1) if there exists one. Klaus and Markus consider the ill-posed case, i.e. the range of {K} is not closed and {g} is replaced by some {g^\delta} which fulfills {\|g-g^\delta\|\leq\delta} (and hence, {g^\delta} is generally not in the range of {K}). Then, the ALM does not converge but diverges. However, one observes “semi-convergence” in practice, i.e. the iterates approach an approximate “solution to {Ku=g^\delta}” (or even a true solution to {Ku=g}) first but then start to diverge from some point on. Then it is natural to ask, if the ALM with {g} replaced by {g^\delta} can be used for regularization, i.e. can one choose a stopping index {k^*} (depending on {\delta} and {g^\delta}) such that the iterates {u_{k^*}^\delta} approach the solution of (1) if {\delta} vanishes? The question has been answered in the affirmative in previous work by Klaus (here and here) and also estimates on the error and convergence rates have been derived under an additional assumption on the solution of (1). This assumption used to be what is called “source condition” and says that there should exist some {p^\dag\in H} such that for a solution {u^\dagger} of (1) it holds that

\displaystyle  K^* p^\dagger \in\partial J(u^\dagger).

Under this assumption it has been shown that the Bregman distance {D(u_{k^*}^\delta,u^\dag)} goes to zero linearly in {\delta} under appropriate stopping rules. What Klaus and Markus investigate in this paper are different conditions which ensure slower convergence rates than linear. These conditions come in the form of “variational inequalities” which gained some popularity lately. As usual, these variational inequalities look some kind of messy at first sight. Klaus and Markus use

\displaystyle  D(u,u^\dag)\leq J(u) - J(u^\dag) + \Phi(\|Ku-g\|^2)

for some positive functional {D} with {D(u,u)=0} and some non-negative, strictly increasing and concave function {\Phi}. Under this assumption (and special {D}) they derive convergence rates which again look quite complicated but can be reduced to simpler and more transparent cases which resemble the situation one knows for other regularization methods (like ordinary Tikhonov regularization).

In the last section Klaus and Markus also treat sparse regularization (i.e. with {J(u) = \|u\|_1}) and derive that a weak condition (like {(K^*K)^\nu p^\dag\in\partial J(u^\dag)} for some {0<\nu<1/2} already imply the stronger one (1) (with a different {p^\dag}). Hence, interestingly, it seems that for sparse regularization one either gets a linear rate or nothing (in this framework).

2. On necessary conditions for variational regularization

The second paper is “Necessary conditions for variational regularization schemes” by Nadja Worliczek and myself. I have discussed some parts of this paper alread on this blog here and here. In this paper we tried to formalize the notion of “a variational method” for regularization with the goal to obtain necessary conditions for a variational scheme to be regularizing. As expected, this goal is quite ambitions and we can not claim that we came up with ultimate necessary condition which describe what kind of variational methods are not possible. However, we could first relate the three kinds of variational methods (which I called Tikhonov, Morozov and Ivanov regularization here) and moreover investigated the conditions on the data space a little closer. In recent years it turned out that one should not always use a term like {\|Ku-g^\delta\|^2} to measure the noise or to penalize the deviation from {Ku} to {g^\delta}. For several noise models (like Poisson noise or multiplicative noise) other functionals are better suited. However, these functionals raise several issues: They are often not defined on a linear space but on a convex set, sometimes with the nasty property that their interior is empty. They often do not have convenient algebraic properties (e.g. scaling invariance, triangle inequalities or the like). Finally they are not necessarily (lower semi-)continuous with respect to the usual topologies. Hence, we approached the data space from quite abstract way: The data space {(Y,\tau_Y)} is topological space which comes with an additional sequential convergence structure {\mathcal{S}} (see e.g. here) and on (a subset of) which there is a discrepancy functional {\rho:Y\times Y\rightarrow [0,\infty]}. Then we analyzed the interplay of these three things {\tau_Y}, {\mathcal{S}} and {\rho}. If you wonder why we use the additional sequential convergence structure, remember that in the (by now classical) setting for Tikhonov regularization in Banach spaces with a functional like

\displaystyle  \|Ku-g^\delta\|_Y^q + \alpha\|u\|_X^p

with some Banach space norms {\|\cdot\|_Y} and {\|\cdot\|_X} there are also two kinds of convergence on {Y}: The weak convergence (which is replaced by {\tau_Y} in our setting) which is, e.g., used to describe convenient (lower semi-)continuity properties of {K} and the norm {\|\cdot\|_Y} and the norm convergence which is used to describe that {g^\delta\rightarrow g^\dag} for {\delta\rightarrow 0}. And since we do not have a normed space {Y} in our setting and one does not use any topological properties of the norm convergence in all the proofs of regularizing properties, Nadja suggested to use a sequential convergence structure instead.

Today I would like to comment on two arxiv-preprints I stumbled upon:

1. “Augmented L1 and Nuclear-Norm Models with a Globally Linearly Convergent Algorithm” – The Elastic Net rediscovered

The paper “Augmented L1 and Nuclear-Norm Models with a Globally Linearly Convergent Algorithm” by Ming-Jun Lai and Wotao Yin is another contribution to a field which is (or was?) probably the fastest growing field in applied mathematics: Algorithms for convex problems with non-smooth {\ell^1}-like terms. The “mother problem” here is as follows: Consider a matrix {A\in{\mathbb R}^{m\times n}}, {b\in{\mathbb R}^m} try to find a solution of

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1\quad\text{s.t.}\quad Ax=b

or, for {\sigma>0}

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1\quad\text{s.t.}\quad \|Ax-b\|\leq\sigma

which appeared here on this blog previously. Although this is a convex problem and even has a reformulation as linear program, some instances of this problem are notoriously hard to solve and gained a lot of attention (because their applicability in sparse recovery and compressed sensing). Very roughly speaking, a part of its hardness comes from the fact that the problem is neither smooth nor strictly convex.

The contribution of Lai and Yin is that they analyze a slight perturbation of the problem which makes its solution much easier: They add another term in the objective; for {\alpha>0} they consider

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1 + \frac{1}{2\alpha}\|x\|_2^2\quad\text{s.t.}\quad Ax=b

or

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1 + \frac{1}{2\alpha}\|x\|_2^2\quad\text{s.t.}\quad \|Ax-b\|\leq\sigma.

This perturbation does not make the problem smooth but renders it strongly convex (which usually makes the dual more smooth). It turns out that this perturbation makes life with this problem (and related ones) much easier – recovery guarantees still exists and algorithms behave better.

I think it is important to note that the “augmentation” of the {\ell^1} objective with an additional squared {\ell^2}-term goes back to Zou and Hastie from the statistics community. There, the motivation was as follows: They observed that the pure {\ell^1} objective tends to “overpromote” sparsity in the sense that if there are two columns in {A} which are almost equally good in explaining some component of {b} then only one of them is used. The “augmented problem”, however, tends to use both of them. They coined the method as “elastic net” (for reasons which I never really got).

I also worked on elastic-net problems for problems in the form

\displaystyle  \min_x \frac{1}{2}\|Ax-b\|^2 + \alpha\|x\|_1 + \frac{\beta}{2}\|x\|_2^2

in this paper (doi-link). Here it also turns out that the problem gets much easier algorithmically. I found it very convenient to rewrite the elastic-net problem as

\displaystyle  \min_x \frac{1}{2}\|\begin{bmatrix}A\\ \sqrt{\beta} I\end{bmatrix}x-\begin{bmatrix}b\\ 0\end{bmatrix}\|^2 + \alpha\|x\|_1

which turns the elastic-net problem into just another {\ell^1}-penalized problem with a special matrix and right hand side. Quite convenient for analysis and also somehow algorithmically.

2. Towards a Mathematical Theory of Super-Resolution

The second preprint is “Towards a Mathematical Theory of Super-Resolution” by Emmanuel Candes and Carlos Fernandez-Granda.

The idea of super-resolution seems to pretty old and, very roughly speaking, is to extract a higher resolution of a measured quantity (e.g. an image) than the measured data allows. Of course, in this formulation this is impossible. But often one can gain something by additional knowledge of the image. Basically, this also is the idea behind compressed sensing and hence, it does not come as a surprise that the results in compressed sensing are used to try to explain when super-resolution is possible.

The paper by Candes and Fernandez-Granada seems to be pretty close in spirit to Exact Reconstruction using Support Pursuit on which I blogged earlier. They model the sparse signal as a Radon measure, especially as a sum of Diracs. However, different from the support-pursuit-paper they use complex exponentials (in contrast to real polynomials). Their reconstruction method is basically the same as support pursuit: The try to solve

\displaystyle   \min_{x\in\mathcal{M}} \|x\|\quad\text{s.t.}\quad Fx=y, \ \ \ \ \ (1)

i.e. they minimize over the set of Radon measures {\mathcal{M}} under the constraint that certain measurements {Fx\in{\mathbb R}^n} result in certain given values {y}. Moreover, they make a thorough analysis of what is “reconstructable” by their ansatz and obtain a lower bound on the distance of two Diracs (in other words, a lower bound in the Prokhorov distance). I have to admit that I do not share one of their claims from the abstract: “We show that one can super-resolve these point sources with infinite precision—i.e. recover the exact locations and amplitudes—by solving a simple convex program.” My point is that I can not see to what extend the problem (1) is a simple one. Well, it is convex, but it does not seem to be simple.

I want to add that the idea of “continuous sparse modelling” in the space of signed measures is very appealing to me and appeared first in Inverse problems in spaces of measures by Kristian Bredies and Hanna Pikkarainen.

How many samples are needed to reconstruct a sparse signal?

Well, there are many, many results around some of which you probably know (at least if you are following this blog or this one). Today I write about a neat result which I found quite some time ago on reconstruction of nonnegative sparse signals from a semi-continuous perspective.

1. From discrete sparse reconstruction/compressed sensing to semi-continuous

The basic sparse reconstruction problem asks the following: Say we have a vector {x\in{\mathbb R}^m} which only has {s<m} non-zero entries and a fat matrix {A\in{\mathbb R}^{n\times m}} (i.e. {n>m}) and consider that we are given measurements {b=Ax}. Of course, the system {Ax=b} is underdetermined. However, we may add a little more prior knowledge on the solution and ask: Is is possible to reconstruct {x} from {b} if we know that the vector {x} is sparse? If yes: How? Under what conditions on {m}, {s}, {n} and {A}? This question created the expanding universe of compressed sensing recently (and this universe is expanding so fast that for sure there has to be some dark energy in it). As a matter of fact, a powerful method to obtain sparse solutions to underdetermined systems is {\ell^1}-minimization a.k.a. Basis Pursuit on which I blogged recently: Solve

\displaystyle  \min_x \|x\|_1\ \text{s.t.}\ Ax=b

and the important ingredient here is the {\ell^1}-norm of the vector in the objective function.

In this post I’ll formulate semi-continuous sparse reconstruction. We move from an {m}-vector {x} to a finite signed measure {\mu} on a closed interval (which we assume to be {I=[-1,1]} for simplicty). We may embed the {m}-vectors into the space of finite signed measures by choosing {m} points {t_i}, {i=1,\dots, m} from the interval {I} and build {\mu = \sum_{i=1}^m x_i \delta_{t_i}} with the point-masses (or Dirac measures) {\delta_{t_i}}. To a be a bit more precise, we speak about the space {\mathfrak{M}} of Radon measures on {I}, which are defined on the Borel {\sigma}-algebra of {I} and are finite. Radon measures are not very scary objects and an intuitive way to think of them is to use Riesz representation: Every Radon measure arises as a continuous linear functional on a space of continuous functions, namely the space {C_0(I)} which is the closure of the continuous functions with compact support in {{]{-1,1}[}} with respect to the supremum norm. Hence, Radon measures work on these functions as {\int_I fd\mu}. It is also natural to speak of the support {\text{supp}(\mu)} of a Radon measure {\mu} and it holds for any continuous function {f} that

\displaystyle  \int_I f d\mu = \int_{\text{supp}(\mu)}f d\mu.

An important tool for Radon measures is the Hahn-Jordan decomposition which decomposes {\mu} into a positive part {\mu^+} and a negative part {\mu^-}, i.e. {\mu^+} and {\mu^-} are non-negative and {\mu = \mu^+-\mu^-}. Finally the variation of a measure, which is

\displaystyle  \|\mu\| = \mu^+(I) + \mu^-(I)

provides a norm on the space of Radon measures.

Example 1 For the measure {\mu = \sum_{i=1}^m x_i \delta_{t_i}} one readily calculates that

\displaystyle  \mu^+ = \sum_i \max(0,x_i)\delta_{t_i},\quad \mu^- = \sum_i \max(0,-x_i)\delta_{t_i}

and hence

\displaystyle  \|\mu\| = \sum_i |x_i| = \|x\|_1.

In this sense, the space of Radon measures provides a generalization of {\ell^1}.

We may sample a Radon measure {\mu} with {n+1} linear functionals and these can be encoded by {n+1} continuous functions {u_0,\dots,u_n} as

\displaystyle  b_k = \int_I u_k d\mu.

This sampling gives a bounded linear operator {K:\mathfrak{M}\rightarrow {\mathbb R}^{n+1}}. The generalization of Basis Pursuit is then given by

\displaystyle  \min_{\mu\in\mathfrak{M}} \|\mu\|\ \text{s.t.}\ K\mu = b.

This was introduced and called “Support Pursuit” in the preprint Exact Reconstruction using Support Pursuit by Yohann de Castro and Frabrice Gamboa.

More on the motivation and the use of Radon measures for sparsity can be found in Inverse problems in spaces of measures by Kristian Bredies and Hanna Pikkarainen.

2. Exact reconstruction of sparse nonnegative Radon measures

Before I talk about the results we may count the degrees of freedom a sparse Radon measure has: If {\mu = \sum_{i=1}^s x_i \delta_{t_i}} with some {s} than {\mu} is defined by the {s} weights {x_i} and the {s} positions {t_i}. Hence, we expect that at least {2s} linear measurements should be necessary to reconstruct {\mu}. Surprisingly, this is almost enough if we know that the measure is nonnegative! We only need one more measurement, that is {2s+1} and moreover, we can take fairly simple measurements, namely the monomials: {u_i(t) = t^i} {i=0,\dots, n} (with the convention that {u_0(t)\equiv 1}). This is shown in the following theorem by de Castro and Gamboa.

Theorem 1 Let {\mu = \sum_{i=1}^s x_i\delta_{t_i}} with {x_i\geq 0}, {n=2s} and let {u_i}, {i=0,\dots n} be the monomials as above. Define {b_i = \int_I u_i(t)d\mu}. Then {\mu} is the unique solution of the support pursuit problem, that is of

\displaystyle  \min \|\nu\|\ \text{s.t.}\ K\nu = b.\qquad \textup{(SP)}

Proof: The following polynomial will be of importance: For a constant {c>0} define

\displaystyle  P(t) = 1 - c \prod_{i=1}^s (t-t_i)^2.

The following properties of {P} will be used:

  1. {P(t_i) = 1} for {i=1,\dots,s}
  2. {P} has degree {n=2s} and hence, is a linear combination of the {u_i}, {i=0,\dots,n}, i.e. {P = \sum_{k=0}^n a_k u_k}.
  3. For {c} small enough it holds for {t\neq t_i} that {|P(t)|<1}.

Now let {\sigma} be a solution of (SP). We have to show that {\|\mu\|\leq \|\sigma\|}. Due to property 2 we know that

\displaystyle  \int_I u_k d\sigma = (K\sigma)k = b_k = \int_I u_k d\mu.

Due to property 1 and non-negativity of {\mu} we conclude that

\displaystyle  \begin{array}{rcl}  \|\mu\| & = & \sum_{i=1}^s x_i = \int_I P d\mu\\ & = & \int_I \sum_{k=0}^n a_k u_k d\mu\\ & = & \sum_{k=0}^n a_k \int_I u_k d\mu\\ & = & \sum_{k=0}^n a_k \int_I u_k d\sigma\\ & = & \int_I P d\sigma. \end{array}

Moreover, by Lebesgue’s decomposition we can decompose {\sigma} with respect to {\mu} such that

\displaystyle  \sigma = \underbrace{\sum_{i=1}^s y_i\delta_{t_i}}_{=\sigma_1} + \sigma_2

and {\sigma_2} is singular with respect to {\mu}. We get

\displaystyle  \begin{array}{rcl}  \int_I P d\sigma = \sum_{i=1}^s y_i + \int P d\sigma_2 \leq \|\sigma_1\| + \|\sigma_2\|=\|\sigma\| \end{array}

and we conclude that {\|\sigma\| = \|\mu\|} and especially {\int_I P d\sigma_2 = \|\sigma_2\|}. This shows that {\mu} is a solution to {(SP)}. It remains to show uniqueness. We show the following: If there is a {\nu\in\mathfrak{M}} with support in {I\setminus\{x_1,\dots,x_s\}} such that {\int_I Pd\nu = \|\nu\|}, then {\nu=0}. To see this, we build, for any {r>0}, the sets

\displaystyle  \Omega_r = [-1,1]\setminus \bigcup_{i=1}^s ]x_i-r,x_i+r[.

and assume that there exists {r>0} such that {\|\nu|_{\Omega_r}\|\neq 0} ({\nu|_{\Omega_r}} denoting the restriction of {\nu} to {\Omega_r}). However, it holds by property 3 of {P} that

\displaystyle  \int_{\Omega_r} P d\nu < \|\nu|_{\Omega_r}\|

and consequently

\displaystyle  \begin{array}{rcl}  \|\nu\| &=& \int Pd\nu = \int_{\Omega_r} Pd\nu + \int_{\Omega_r^C} P d\nu\\ &<& \|\nu|_{\Omega_r}\| + \|\nu|_{\Omega_r^C}\| = \|\nu\| \end{array}

which is a contradiction. Hence, {\nu|_{\Omega_r}=0} for all {r} and this implies {\nu=0}. Since {\sigma_2} has its support in {I\setminus\{x_1,\dots,x_s\}} we conclude that {\sigma_2=0}. Hence the support of {\sigma} is exactly {\{x_1,\dots,x_s\}}. and since {K\sigma = b = K\mu} and hence {K(\sigma-\mu) = 0}. This can be written as a Vandermonde system

\displaystyle  \begin{pmatrix} u_0(t_1)& \dots &u_0(t_s)\\ \vdots & & \vdots\\ u_n(t_1)& \dots & u(t_s) \end{pmatrix} \begin{pmatrix} y_1 - x_1\\ \vdots\\ y_s - x_s \end{pmatrix} = 0

which only has the zero solution, giving {y_i=x_i}. \Box

3. Generalization to other measurements

The measurement by monomials may sound a bit unusual. However, de Castro and Gamboa show more. What really matters here is that the monomials for a so-called Chebyshev-System (or Tchebyscheff-system or T-system – by the way, have you ever tried to google for a T-system?). This is explained, for example in the book “Tchebycheff Systems: With Applications in Analysis and Statistics” by Karlin and Studden. A T-system on {I} is simply a set of {n+1} functions {\{u_0,\dots, u_n\}} such that any linear combination of these functions has at most {n} zeros. These systems are called after Tchebyscheff since they obey many of the helpful properties of the Tchebyscheff-polynomials.

What is helpful in our context is the following theorem of Krein:

Theorem 2 (Krein) If {\{u_0,\dots,u_n\}} is a T-system for {I}, {k\leq n/2} and {t_1,\dots,t_k} are in the interior of {I}, then there exists a linear combination {\sum_{k=0}^n a_k u_k} which is non-negative and vanishes exactly the the point {t_i}.

Now consider that we replace the monomials in Theorem~1 by a T-system. You recognize that Krein’s Theorem allows to construct a “generalized polynomial” which fulfills the same requirements than the polynomial {P} is the proof of Theorem~1 as soon as the constant function 1 lies in the span of the T-system and indeed the result of Theorem~1 is also valid in that case.

4. Exact reconstruction of {s}-sparse nonnegative vectors from {2s+1} measurements

From the above one can deduce a reconstruction result for {s}-sparse vectors and I quote Theorem 2.4 from Exact Reconstruction using Support Pursuit:

Theorem 3 Let {n}, {m}, {s} be integers such that {s\leq \min(n/2,m)} and let {\{1,u_1,\dots,u_n\}} be a complete T-system on {I} (that is, {\{1,u_1,\dots,u_r\}} is a T-system on {I} for all {r<n}). Then it holds: For any distinct reals {t_1,\dots,t_m} and {A} defined as

\displaystyle  A=\begin{pmatrix} 1 & \dots & 1\\ u_1(t_1)& \dots &u_1(t_m)\\ \vdots & & \vdots\\ u_n(t_1)& \dots & u(t_m) \end{pmatrix}

Basis Pursuit recovers all nonnegative {s}-sparse vectors {x\in{\mathbb R}^m}.

5. Concluding remarks

Note that Theorem~3 gives a deterministic construction of a measurement matrix.

Also note, that nonnegativity is crucial in what we did here. This allowed (in the monomial case) to work with squares and obtain the polynomial {P} in the proof of Theorem~1 (which is also called “dual certificate” in this context). This raises the question how this method can be adapted to all sparse signals. One needs (in the monomial case) a polynomial which is bounded by 1 but matches the signs of the measure on its support. While this can be done (I think) for polynomials it seems difficult to obtain a generalization of Krein’s Theorem to this case…

Next Page »

Follow

Get every new post delivered to your Inbox.

Join 46 other followers