Today I report on two things I came across here at ISMP:

• The first is a talk by Russell Luke on Constraint qualifications for nonconvex feasibility problems. Luke treated the NP-hard problem of sparsest solutions of linear systems. In fact he did not tackle this problem but the problem to find an ${s}$-sparse solution of an ${m\times n}$ system of equations. He formulated this as a feasibility-problem (well, Heinz Bauschke was a collaborator) as follows: With the usual malpractice let us denote by ${\|x\|_0}$ the number of non-zero entries of ${x\in{\mathbb R}^n}$. Then the problem of finding an ${s}$-sparse solution to ${Ax=b}$ is:

$\displaystyle \text{Find}\ x\ \text{in}\ \{\|x\|_0\leq s\}\cap\{Ax=b\}.$

In other words: find a feasible point, i.e. a point which lies in the intersection of the two sets. Well, most often feasibility problems involve convex sets but here, the first one given by this “${0}$-norm” is definitely not convex. One of the simplest algorithms for the convex feasibility problem is to alternatingly project onto both sets. This algorithm dates back to von Neumann and has been analyzed in great detail. To make this method work for non-convex sets one only needs to know how to project onto both sets. For the case of the equality constraint ${Ax=b}$ one can use numerical linear algebra to obtain the projection. The non-convex constraint on the number of non-zero entries is in fact even easier: For ${x\in{\mathbb R}^n}$ the projection onto ${\{\|x\|_0\leq s\}}$ consists of just keeping the ${s}$ largest entries of ${x}$ while setting the others to zero (known as the “best ${s}$-term approximation”). However, the theory breaks down in the case of non-convex sets. Russell treated problem in several papers (have a look at his publication page) and in the talk he focused on the problem of constraint qualification, i.e. what kind of regularity has to be imposed on the intersection of the two sets. He could shows that (local) linear convergence of the algorithm (which is observed numerically) can indeed be justified theoretically. One point which is still open is the phenomenon that the method seems to be convergent regardless of the initialization and that (even more surprisingly) that the limit point seems to be independent of the starting point (and also seems to be robust with respect to overestimating the sparsity ${s}$). I wondered if his results are robust with respect to inexact projections. For larger problems the projection onto the equality constraint ${Ax=b}$ are computationally expensive. For example it would be interesting to see what happens if one approximates the projection with a truncated CG-iteration as Andreas, Marc and I did in our paper on subgradient methods for Basis Pursuit.

• Joel Tropp reported on his paper Sharp recovery bounds for convex deconvolution, with applications together with Michael McCoy. However, in his title he used demixing instead of deconvolution (which, I think, is more appropriate and leads to less confusion). With “demixing” they mean the following: Suppose you have two signals ${x_0}$ and ${y_0}$ of which you observe only the superposition of ${x_0}$ and a unitarily transformed ${y_0}$, i.e. for a unitary matrix ${U}$ you observe

$\displaystyle z_0 = x_0 + Uy_0.$

Of course, without further assumptions there is no way to recover ${x_0}$ and ${y_0}$ from the knowledge of ${z_0}$ and ${U}$. As one motivation he used the assumption that both ${x_0}$ and ${y_0}$ are sparse. After the big bang of compressed sensing nobody wonders that one turns to convex optimization with ${\ell^1}$-norms in the following manner:

$\displaystyle \min_{x,y} \|x\|_1 + \lambda\|y\|_1 \ \text{such that}\ x + Uy = z_0. \ \ \ \ \ (1)$

This looks a lot like sparse approximation: Eliminating ${x}$ one obtains the unconstraint problem \begin{equation*} \min_y \|z_0-Uy\|_1 + \lambda \|y\|_1. \end{equation*}

Phrased differently, this problem aims at finding an approximate sparse solution of ${Uy=z_0}$ such that the residual (could also say “noise”) ${z_0-Uy=x}$ is also sparse. This differs from the common Basis Pursuit Denoising (BPDN) by the structure function for the residual (which is the squared ${2}$-norm). This is due to the fact that in BPDN one usually assumes Gaussian noise which naturally lead to the squared ${2}$-norm. Well, one man’s noise is the other man’s signal, as we see here. Tropp and McCoy obtained very sharp thresholds on the sparsity of ${x_0}$ and ${y_0}$ which allow for exact recovery of both of them by solving (1). One thing which makes their analysis simpler is the following reformulation: The treated the related problem \begin{equation*} \min_{x,y} \|x\|_1 \text{such that} \|y\|_1\leq\alpha, x+Uy=z_0 \end{equation*} (which I would call this the Ivanov version of the Tikhonov-problem (1)). This allows for precise exploitation of prior knowledge by assuming that the number ${\alpha_0 = \|y_0\|_1}$ is known.

First I wondered if this reformulation was responsible for their unusual sharp results (sharper the results for exact recovery by BDPN), but I think it’s not. I think this is due to the fact that they have this strong assumption on the “residual”, namely that it is sparse. This can be formulated with the help of ${1}$-norm (which is “non-smooth”) in contrast to the smooth ${2}$-norm which is what one gets as prior for Gaussian noise. Moreover, McCoy and Tropp generalized their result to the case in which the structure of ${x_0}$ and ${y_0}$ is formulated by two functionals ${f}$ and ${g}$, respectively. Assuming a kind of non-smoothness of ${f}$ and ${g}$ the obtain the same kind of results and especially matrix decomposition problems are covered.

Coming back to regularization, especially Ivanov regularization. Recall that I used the term Ivanov regularization for the minimization problem

$\displaystyle \min S(Ax,y^\delta)\ \text{ s.t. }\ R(x)\leq \tau. \ \ \ \ \ (1)$

I again stumbled upon some reference: It seems that in the case that the constraint ${R(x)\leq \tau}$ defines a compact set, this method is usually referred to as “method of quasi solutions”. More precisely, I found this in “Elements of the theory of inverse problems” by A.M. Denisov, Chapter 6. There he uses metric spaces and proves the following:

Theorem 1 Let ${X,Y}$ be metric spaces with metrics ${d_X}$, ${d_Y}$, respectively and ${A:X\rightarrow Y}$ continuous. Furthermore let ${M\subset X}$ be compact, ${y^\dagger}$ be in the range of ${A}$ and assume that ${x^\dagger}$ is the unique solution of ${Ax=y^\dagger}$ which lies in ${M}$. Finally for a ${y^\delta}$ with ${d_Y(y^\delta,y^\dagger)\leq\delta}$ define ${X_\delta = \{x\ :\ d_Y(Ax,y^\delta)\leq\delta\}}$ and ${X_\delta^M = X_\delta\cap M}$. Then it holds for ${\delta\rightarrow 0}$ that

$\displaystyle \sup_{x\in X_\delta^M}d_X(x,x^\dagger) \rightarrow 0.$

Remark 1 Before we prove this theorem, we relate is to what I called Ivanov regularization above: The set ${M}$ is encoded in~(1) as ${M = \{x\ :\ R(x)\leq\tau\}}$ and the “discrepancy measure” ${S}$ is simply the metric ${d_Y}$. Hence, let ${x_M^\delta}$ denote a solution of

$\displaystyle \min\ d_Y(Ax,y^\delta)\ \text{ s.t. } x\in M.$

Because ${x^\dagger}$ is feasible for this problem it follows from ${d_Y(Ax^\dagger,y^\delta) = d_Y(y^\dagger,y^\delta)\leq\delta}$ that ${d_Y(Ax_M^\delta,y^\delta)\leq\delta}$. Hence, ${x_M^\delta\in X_M^\delta}$. In other words: Ivanov regularization produces one element in the set ${X_M^\delta}$. Now, the theorem says that every element in ${X_M^\delta}$ is a good approximation for ${x^\dagger}$ (at least asymptotically).

Proof: We take a sequence ${\delta_n\rightarrow 0}$ and assume to the contrary that there exist ${\epsilon>0}$ such that for every ${n}$ there exists ${x_{\delta_n}\in X_M^{\delta_n}}$ such that it holds that ${d_X(x_{\delta_n},x^\dagger)\geq \epsilon}$. Since all ${x_{\delta_n}}$ lie in ${M}$ which is compact, there is a convergent subsequence ${(x_k)}$ with limit ${\bar x}$. We obtain ${d_X(\bar x,x^\dagger)\geq \epsilon}$. However, this contradicts the assumption: d_Y(A\bar x,Ax^\dagger) & = &d_Y(A\bar x,y^\dagger) = \lim_{n\rightarrow \infty} d_Y(Ax_{\delta_n},y^\dagger) \nonumber
& \leq &\lim_{n\rightarrow \infty} d_Y(Ax_{\delta_n},y^{\delta_n}) + d_Y(y^{\delta_n},y^\dagger) \leq \lim_{n\rightarrow\infty}2\delta_n =0. $\Box$

Coming back to the interpretation of the Theorem~1 and Ivanov regularization: Instead of Ivanov regularization, one could also use the following feasibility problem: Find an ${x}$ such that both ${d_Y(Ax,y^\delta)\leq\delta}$ and ${x\in M}$. For the case of vector spaces ${X}$ and ${Y}$ and a convex set ${M}$, this would be a convex feasibility problem which one may attack by available methods.

A further important remark is that we did not assume any linearity on ${A}$ (of course: we did not even assume a linear structure on ${X}$ or ${Y}$). Hence, the theorem seem very powerful: There is no regularization parameter involved and one still gets convergence to the true solution! However, one of the assumptions in the theorem is somehow strong: The uniqueness of ${x^\dagger}$. To illustrate this we consider a special case:

Example 1 Let ${X}$ and ${Y}$ be real (or complex) vector spaces and ${A}$ be linear with non-trivial null space. Furthermore, assume that ${M\subset X}$ is convex and compact and consider scaled versions ${\tau M}$ for ${\tau>0}$. Then the set of solutions of ${Ax=y^\dagger}$ is an affine space in ${X}$ and there are three cases for the intersection of this set and ${\tau M}$:

1. The intersection is empty.
2. The intersection is a convex set and contains infinitely many elements.
3. The intersection contains exactly one element.

The last case occurs precisely, when the affine space of solution is tangential to ${\tau M}$. Loosely speaking, one may say that this case only occurs, if the set ${M}$ is scaled precisely to the right size such that is only touches the affine space of solutions.

Another strong assumption in Theorem~1 is that the set ${M}$ is compact. First there is a way to somehow relax this condition. Basically, we only need compactness to obtain the converging subsequence. Hence, one could try to work with a weaker topology on ${Y}$ (which would result in a weaker notion of compactness) and then obtain a limit of a subsequence which converges in the weaker sense only. Then one would need some tool to deduce that the weak limit is indeed a solution. This strategy work, for example in Banach spaces:

Example 2 Let ${X}$ and ${Y}$ be reflexive Banach spaces and ${A:X\rightarrow Y}$ be linear, bounded and one-to-one. We use the set ${M = \{x\ :\ \|x\|_X\leq R\}}$ as prior knowledge on the solution of ${Ax=y^\dagger}$. Moreover, we use the metrics induced the norms of ${X}$ and ${Y}$, respectively: ${d_X(x,x') = \|x-x'\|_X}$ and ${d_Y(y,y') = \|y-y'\|_Y}$.

Obviously, ${M}$ is not compact (if ${X}$ is of infinite dimension) but it is weakly compact (and by the Eberlein-Smulian theorem also weakly-sequentially compact). In the situation of the proof of Theorem~1 we only get a weakly converging subsequence ${x_{\delta_n}\rightharpoonup \bar x}$. However, a linear operator ${A}$ is also weak-to-weak linear, and hence ${Ax_{\delta_n}\rightharpoonup A\bar x}$. While we only have a weakly converging sequence, we still can obtain the contradiction in~(0) since the norm is weakly lower semicontinuous.

Another way to justify the assumption that the solution is in a known compact set, is that in practice we always use a representation of the solution which only use a finite number of degrees of freedom (think of a Galerkin ansatz for example). However, this interpretation somehow neglects that we are interested in finding the true solution to the true infinite dimensional problem and that the discretization of the problem should be treated as a different issue. Just building on the regularizing effect of discretization will almost surely result in a method which stability properties depend on the resolution of the discretization.

Finally: Another good reference of these somehow ancient results in regularization theory is one of the first books on this topics: “Solutions of ill-posed problems” by Tikhonov and Arsenin (1977). While it took me some time to get used the type of presentation, I have to admit that it is really worth to read this book (and other translation of Russian mathematical literature).

Some time ago I picked up the phrase Ivanov regularization. Starting with an operator $A:X\to Y$ between to Banach spaces (say) one encounters the problem of instability of the solution of $Ax=y$ if $A$ has non-closed range. One dominant tool to regularize the solution is called Tikhonov regularization and consists of minimizing the functional $\|Ax - y^\delta\|_Y^p + \alpha \|x\|_Y^q$. The meaning behind these terms is as follows: The term $\|Ax -y^\delta \|_Y^p$ is often called discrepancy and it should be not too large to guarantee, that the “solution” somehow explains the data. The term $\|x\|_Y^q$ is often called regularization functional and shall not be too large to have some meaningful notion of “solution”. The parameter $\alpha>0$ is called regularization parameter and allows weighting between the discrepancy and regularization.

For the case of Hilbert space one typically chooses $p=q=2$ and gets a functional for which the minimizer is given more or less explicitly as

$x_\alpha = (A^*A + \alpha I)^{-1} A^* y^\delta$.

The existence of this explicit solution seems to be one of the main reasons for the broad usage of Tikhonov regularization in the Hilbert space setting.

Another related approach is sometimes called residual method, however, I would prefer the term Morozov regularization. Here one again balances the terms “discrepancy” and “regularization” but in a different way: One solves

$\min \|x\|_X\ \text{s.t.}\ \|Ax-y^\delta\|_Y\leq \delta.$

That is, one tries to find an $x$ with minimal norm which explains the data $y^\delta$ up to an accuracy $\delta$. The idea is, that $\delta$ reflects the so called noise level, i.e. an estimate of the error which is made during the measurment of $y$. One advantage of Morozov regularization over Tikhonov regularization is that the meaning of the parameter $\delta>0$ is much clearer that the meaning of $\alpha>0$. However, there is no closed form solution for Morozov regularization.

Ivanov regularization is yet another method: solve

$\min \|Ax-y^\delta\|_Y\ \text{s.t.}\ \|x\|_X \leq \tau.$

Here one could say, that one wants to have the smallest discrepancy among all $x$ which are not too “rough”.

Ivanov regularization in this form does not have too many appealing properties: The parameter $\tau>0$ does not seem to have a proper motivation and moreover, there is again no closed for solution.

However, recently the focus of variational regularization (as all these method may be called) has shifted from using norms, to the use of more general functionals. One even considers Tikhonov in an abstract form as minimizing

$S(Ax,y^\delta) + \alpha R(x)$

with a “general” similarity measure $S$ and a general regularization term $R$, see e.g. the dissertation of Christiane Pöschl (which can be found here, thanks Christiane) or the works of Jens Flemming. Prominent examples for the similarity measure are of course norms of differences or the Kullback-Leibler divergence or the Itakura-Saito divergence which are both treated in this paper. For the regularization term one uses norms and semi-norms in various spaces, e.g. Sobolev (semi-)norms, Besov (semi-)norms, the total variation seminorm or $\ell^p$ norms.

In all these cases, the advantage of Tikhonov regularization of having a closed form solution is not there anymore. Then, the most natural choice would be, in my opinion, Morozov regularization, because one may use the noise level directly as a parameter. However, from a practical point of view one also should care about the problem of calculating the minimizer of the respective problems. Here, I think that Ivanov regularization is important again: Often the similarity measure $S$ is somehow smooth but the regularization term $R$ is nonsmooth (e.g. for total variation regularization or sparse regularization with $\ell^p$-penalty). Hence, both Tikhononv and Morozov regularization have a nonsmooth objective function. Somehow, Tikhonov regularization is still a bit easier, since the minimization is unconstrained. Morozov regularization has a constraint which is usually quite difficult to handle. E.g. it is usually difficult (is it probably even ill posed?) to project onto the set defined by $S(Ax,y^\delta)\leq \delta$. Ivanov regulaization has a smooth objective functional (at least if the similarity measure is smooth) and a constraint which is usually somehow simple (i.e. projections are not too difficult to obtain).

Now, I found, that all thee methods, Tikhonov, Morozov and Ivanov regularizazion are all treated in the book “Theory of linear ill-posed problems and its applications” by V. K. Ivanov,V. V. Vasin and Vitaliĭ Pavlovich Tanana in section 3.2, 3.3 and 3.4 respectively. Ivanov regularization goes under the name “method of quasi solutions” (section 3.2) and Morozov regularization is called “Method of residual”(section 3.4). Well, I think I should read these sections a bit closer now…