1. A numerical experiment on sparse regularization

To start, I take a standard problem from the Regularization Tools Matlab toolbox: The problem \texttt{deriv2}. This problem generates a matrix {A} and two vectors {x} and {b} such that the equation {Ax=b} is a Galerkin discretization of the integral equation

\displaystyle  g(t) = \int_0^1 k(t,s)f(s) ds

with a kernel {k} such that the solution amounts to solving a boundary value problem. The Galerkin ansatz functions are simply orthonormal characteristic functions on intervals, i.e. {\psi_i(x) = h^{-1/2}\chi_{[ih,(i+1)h]}(x)}. Thus, I work with matrices {A_h} and vectors {x_h} and {b_h}.

I want to use sparse regularization to reconstruct spiky solutions, that is, I solve problems

\displaystyle  \min_{x_h} \tfrac{1}{2}\|A_h x_x - b_h\|^2 + \alpha_h\|x_h\|_1.

Now, my first experiment goes as follows:

Experiment 1 (Discretization goes to zero)
I generate spiky data: I fix a point {t_0} in the interval {[0,1]}, namely {t_0 = 0.2}, and a value {a_0=1}. Now I consider the data {f} which is a delta peak of height {a_0} and {t_0} (which in turn leads to a right hand side {g}). I construct the corresponding {x_h} and the right hand side {b_h=A_hx_h}. Now I aim at solving

\displaystyle  \min_f \tfrac{1}{2}\| g - \int_0^1 k(\cdot,s)f(s)ds\|_2^2 + \alpha \|f\|_1

for different discretizations ({h\rightarrow 0}). In the numerics, I have to scale {\alpha} with {h}, i.e. I solve

\displaystyle  \min_{x_h} \tfrac{1}{2}\|A_h x_x - b_h\|^2 + h\,\alpha\|x_h\|_1.

and I obtain the following results: In black I show the data {x}, {b} and so on, and in blue I plot the minimizer and its image under {A}.

For {n=10}:

For {n=50}:

For {n=100}:

For {n=500}:

For {n=1000}:

Note that the scale varies in the pictures, except in the lower left one where I show the discretized {g}. As is should be, this converges nicely to a piecewise linear function. However, the discretization of the solution blows up which is also as it should be, since I discretize a delta peak. Well, this basically shows, that my scaling is correct.

From the paper Sparse regularization with {\ell^q} penalty term one can extract the following result.

Theorem 1 Let {K:\ell^2\rightarrow Y} be linear, bounded and injective and let {u^\dagger \in \ell^2} have finite support. Moreover let {g^\dagger = Ku^\dagger} and {\|g^\dagger-g^\delta\|\leq \delta}. Furthermore, denote with {u_\alpha^\delta} the minimizer of

\displaystyle  \tfrac12\|Ku-g^\delta\|^2 + \alpha\|u\|_1.

Then, for {\alpha = c\delta} it holds that

\displaystyle  \|u_\alpha^\delta - u^\dagger\|_1 = \mathcal{O}(\delta).

Now let’s observe this convergence rates in a second experiment:

Experiment 2 (Convergence rate {\mathcal{O}(\delta)}) Now we fix the discretization (i.e. {n=500}), and construct a series of {g^\delta}‘s for {\delta} in a logscale between {1} and {10^{-6}}. I scale {\alpha} proportional to {\delta} and caluclate minimizers of

\displaystyle  \min_{x_h} \tfrac{1}{2}\|A_h x_x - b_h\|^2 + h\,\alpha\|x_h\|_1.

The I measure the error {\|f_\alpha^\delta-f^\dagger\|_1} and plot it doubly logarithmically against {\delta}.

And there you see the linear convergence rate as predicted.

In a final experiment I vary both {\delta} and {n}:

Experiment 3 ({\delta\rightarrow 0} and “{n\rightarrow\infty}”)Now we repeat Experiment 1 for different {n} and put all the loglog plots in one figure. This looks like this: You clearly observe the linear convergence rate in any case. But there is another important thing: The larger the {n} (i.e. the smaller the {h}), the later the linear rate kicks is (i.e. for smaller {\delta}). You may wonder what the reason is. By looking at the reconstruction for varying {n} and {\delta} (which I do not show here) you see the following behavior: For large noise the regularized solutions consist of several peaks located all over the place and with vanishing noise, one peak close to the original one gets dominant. However, this peak is not at the exact position, but at a slightly larger {t}; moreover, it is slightly smaller. Then, this peak moves towards the right position and is also getting larger. Finally, the peak arrives at the exact position and remains there while approaching the correct height.

Hence, the linear rate kicks in, precisely when the accuracy is higher than the discretization level.

Conclusion:

  • The linear convergence rate is only present in the discrete case. Moreover, it starts at a level which can not be resolved by the discretization.
  • “Sparsity penalties” in the continuous case are a different and delicate matter. You may consult the preprint “Inverse problems in spaces of measures” which formulates the sparse recovery problem in a continuous setting but in the space of Radon measures rather than in {L^1} (which is simply not working). There Kristian and Hanna show weak* convergence of the minimizers.
  • Finally, for “continuous sparsity” also some kind of convergence is true, however, not in norm (which really should be the variation norm in measure space). Weak* convergence can be quantified by the Prokhorov metric or the Wasserstein metric (which is also called earth movers distance in some comiunities). Convergence with respect to these metric should be true (under some assumptions) but seem hard to prove. Convergence rates would be cool, but seem even harder.