Regularization

The problem of optimal transport of mass from one distribution to another can be stated in many forms. Here is the formulation going back to Kantorovich: We have two measurable sets ${\Omega_{1}}$ and ${\Omega_{2}}$, coming with two measures ${\mu_{1}}$ and ${\mu_{2}}$. We also have a function ${c:\Omega_{1}\times \Omega_{2}\rightarrow {\mathbb R}}$ which assigns a transport cost, i.e. ${c(x_{1},x_{2})}$ is the cost that it takes to carry one unit of mass from point ${x_{1}\in\Omega_{2}}$ to ${x_{2}\in\Omega_{2}}$. What we want is a plan that says where the mass in ${\mu_{1}}$ should be placed in ${\Omega_{2}}$ (or vice versa). There are different ways to formulate this mathematically.

A simple way is to look for a map ${T:\Omega_{1}\rightarrow\Omega_{2}}$ which says that thet mass in ${x_{1}}$ should be moved to ${T(x_{1})}$. While natural, there is a serious problem with this: What if not all mass at ${x_{1}}$ should go to the same point in ${\Omega_{2}}$? This happens in simple situations where all mass in ${\Omega_{1}}$ sits in just one point, but there are at least two different points in ${\Omega_{2}}$ where mass should end up. This is not going to work with a map ${T}$ as above. So, the map ${T}$ is not flexible enough to model all kinds of transport we may need.

What we want is a way to distribute mass from one point in ${\Omega_{1}}$ to the whole set ${\Omega_{2}}$. This looks like we want maps ${\mathcal{T}}$ which map points in ${\Omega_{1}}$ to functions on ${\Omega_{2}}$, i.e. something like ${\mathcal{T}:\Omega_{1}\rightarrow (\Omega_{2}\rightarrow{\mathbb R})}$ where ${(\Omega_{2}\rightarrow{\mathbb R})}$ stands for some set of functions on ${\Omega_{2}}$. We can de-curry this function to some ${\tau:\Omega_{1}\times\Omega_{2}\rightarrow{\mathbb R}}$ by ${\tau(x_{1},x_{2}) = \mathcal{T}(x_{1})(x_{2})}$. That’s good in principle, be we still run into problems when the target mass distribution ${\mu_{2}}$ is singular in the sense that ${\Omega_{2}}$ is a “continuous” set and there are single points in ${\Omega_{2}}$ that carry some mass according to ${\mu_{2}}$. Since we are in the world of measure theory already, the way out suggests itself: Instead of a function ${\tau}$ on ${\Omega_{1}\times\Omega_{2}}$ we look for a measure ${\pi}$ on ${\Omega_{1}\times \Omega_{2}}$ as a transport plan.

The demand that we should carry all of the mass in ${\Omega_{1}}$ to reach all of ${\mu_{2}}$ is formulated by marginals. For simplicity we just write these constraints as

$\displaystyle \int_{\Omega_{2}}\pi\, d x_{2} = \mu_{1},\qquad \int_{\Omega_{1}}\pi\, d x_{1} = \mu_{2}$

(with the understanding that the first equation really means that for all continuous function ${f:\Omega_{1}\rightarrow {\mathbb R}}$ it holds that ${\int_{\Omega_{1}\times \Omega_{2}} f(x_{1})\,d\pi(x_{1},x_{2}) = \int_{\Omega_{1}}f(x_{1})\,d\mu_{1}(x_{1})}$).

This leads us to the full transport problem

$\displaystyle \min_{\pi}\int_{\Omega_{1}\times \Omega_{2}}c(x,y)\,d\pi(x_{1}x_{2})\quad \text{s.t.}\quad \int_{\Omega_{2}}\pi\, d x_{2} = \mu_{1},\quad \int_{\Omega_{1}}\pi\, d x_{1} = \mu_{2}.$

There is the following theorem which characterizes optimality of a plan and which is the topic of this post:

Theorem 1 (Fundamental theorem of optimal transport) Under some technicalities we can say that a plan ${\pi}$ which fulfills the marginal constraints is optimal if and only if one of the following equivalent conditions is satisfied:

1. The support ${\mathrm{supp}(\pi)}$ of ${\pi}$ is ${c}$-cyclically monotone.
2. There exists a ${c}$-concave function ${\phi}$ such that its ${c}$-superdifferential contains the support of ${\pi}$, i.e. ${\mathrm{supp}(\pi)\subset \partial^{c}\phi}$.

A few clarifications: The technicalities involve continuity, integrability, and boundedness conditions of ${c}$ and integrability conditions on the marginals. The full theorem can be found as Theorem 1.13 in A user’s guide to optimal transport by Ambrosio and Gigli. Also the notions ${c}$-cyclically monotone, ${c}$-concave and ${c}$-superdifferential probably need explanation. We start with a simpler notion: ${c}$-monotonicity:

Definition 2 A set ${\Gamma\subset\Omega_{1}\times\Omega_{2}}$ is ${c}$-monotone, if for all ${(x_{1}x_{2}),(x_{1}',x_{2}')\in\Gamma}$ it holds that

$\displaystyle c(x_{1},x_{2}) + c(x_{1}',x_{2}')\leq c(x_{1},x_{2}') + c(x_{1}',x_{2}).$

If you find it unclear what this has to do with monotonicity, look at this example:

Example 1 Let ${\Omega_{1/2}\in{\mathbb R}^{d}}$ and let ${c(x_{1},x_{2}) = \langle x_{1},x_{2}\rangle}$ be the usual scalar product. Then ${c}$-monotonicity is the condition that for all ${(x_{1}x_{2}),(x_{1}',x_{2}')\in\Gamma\subset\Omega_{1}\times\Omega_{2}}$ it holds that

$\displaystyle 0\leq \langle x_{1}-x_{1}',x_{2}-x_{2}'\rangle$

which may look more familiar. Indeed, when ${\Omega_{1}}$ and ${\Omega_{2}}$ are subset of the real line, the above conditions means that the set ${\Gamma}$ somehow “moves up in ${\Omega_{2}}$” if we “move right in ${\Omega_{1}}$”. So ${c}$-monotonicity for ${c(x_{1},x_{2}) = \langle x_{1},x_{2}\rangle}$ is something like “monotonically increasing”. Similarly, for ${c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle}$, ${c}$-monotonicity means “monotonically decreasing”.

You may say that both ${c(x_{1},x_{2}) = \langle x_{1},x_{2}\rangle}$ and ${c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle}$ are strange cost functions and I can’t argue with that. But here comes: ${c(x_{1},x_{2}) = |x_{1}-x_{2}|^{2}}$ (${|\,\cdot\,|}$ being the euclidean norm) seems more natural, right? But if we have a transport plan ${\pi}$ for this ${c}$ for some marginals ${\mu_{1}}$ and ${\mu_{2}}$ we also have

$\displaystyle \begin{array}{rcl} \int_{\Omega_{1}\times \Omega_{2}}c(x_{1},x_{2})d\pi(x_{1},x_{2}) & = & \int_{\Omega_{1}\times \Omega_{2}}|x_{1}|^{2} d\pi(x_{1},x_{2})\\ &&\quad- \int_{\Omega_{1}\times \Omega_{2}}\langle x_{1},x_{2}\rangle d\pi(x_{1},x_{2})\\ && \qquad+ \int_{\Omega_{1}\times \Omega_{2}} |x_{2}|^{2}d\pi(x_{1},x_{2})\\ & = &\int_{\Omega_{1}}|x_{1}|^{2}d\mu_{1}(x_{1}) - \int_{\Omega_{1}\times \Omega_{2}}\langle x_{1},x_{2}\rangle d\pi(x_{1},x_{2}) + \int_{\Omega_{2}}|x_{2}|^{2}d\mu_{2}(x_{2}) \end{array}$

i.e., the transport cost for ${c(x_{1},x_{2}) = |x_{1}-x_{2}|^{2}}$ differs from the one for ${c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle}$ only by a constant independent of ${\pi}$, so may well use the latter.

The fundamental theorem of optimal transport uses the notion of ${c}$-cyclical monotonicity which is stronger that just ${c}$-monotonicity:

Definition 3 A set ${\Gamma\subset \Omega_{1}\times \Omega_{2}}$ is ${c}$-cyclically monotone, if for all ${(x_{1}^{i},x_{2}^{i})\in\Gamma}$, ${i=1,\dots n}$ and all permutations ${\sigma}$ of ${\{1,\dots,n\}}$ it holds that

$\displaystyle \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{i}) \leq \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{\sigma(i)}).$

For ${n=2}$ we get back the notion of ${c}$-monotonicity.

Definition 4 A function ${\phi:\Omega_{1}\rightarrow {\mathbb R}}$ is ${c}$-concave if there exists some function ${\psi:\Omega_{2}\rightarrow{\mathbb R}}$ such that

$\displaystyle \phi(x_{1}) = \inf_{x_{2}\in\Omega_{2}}c(x_{1},x_{2}) - \psi(x_{2}).$

This definition of ${c}$-concavity resembles the notion of convex conjugate:

Example 2 Again using ${c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle}$ we get that a function ${\phi}$ is ${c}$-concave if

$\displaystyle \phi(x_{1}) = \inf_{x_{2}}-\langle x_{1},x_{2}\rangle - \psi(x_{2}),$

and, as an infimum over linear functions, ${\phi}$ is clearly concave in the usual way.

Definition 5 The ${c}$-superdifferential of a ${c}$-concave function is

$\displaystyle \partial^{c}\phi = \{(x_{1},x_{2})\mid \phi(x_{1}) + \phi^{c}(x_{2}) = c(x,y)\},$

where ${\phi^{c}}$ is the ${c}$-conjugate of ${\phi}$ defined by

$\displaystyle \phi^{c}(x_{2}) = \inf_{x_{1}\in\Omega_{1}}c(x_{1},x_{2}) -\phi(x_{1}).$

Again one may look at ${c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle}$ and observe that the ${c}$-superdifferential is the usual superdifferential related to the supergradient of concave functions (there is a Wikipedia page for subgradient only, but the concept is the same with reversed signs in some sense).

Now let us sketch the proof of the fundamental theorem of optimal transport: \medskip

Proof (of the fundamental theorem of optimal transport). Let ${\pi}$ be an optimal transport plan. We aim to show that ${\mathrm{supp}(\pi)}$ is ${c}$-cyclically monotone and assume the contrary. That is, we assume that there are points ${(x_{1}^{i},x_{2}^{i})\in\mathrm{supp}(\pi)}$ and a permutation ${\sigma}$ such that

$\displaystyle \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{i}) > \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{\sigma(i)}).$

We aim to construct a ${\tilde\pi}$ such that ${\tilde\pi}$ is still feasible but has a smaller transport cost. To do so, we note that continuity of ${c}$ implies that there are neighborhoods ${U_{i}}$ of ${x_{1}^{i}}$ and ${V_{i}}$ of ${x_{2}^{i}}$ such that for all ${u_{i}\in U_{1}}$ and ${v_{i}\in V_{i}}$ it holds that

$\displaystyle \sum_{i=1}^{n}c(u_{i},v_{\sigma(i)}) - c(u_{i},v_{i})<0.$

We use this to construct a better plan ${\tilde \pi}$: Take the mass of ${\pi}$ in the sets ${U_{i}\times V_{i}}$ and shift it around. The full construction is a little messy to write down: Define a probability measure ${\nu}$ on the product ${X = \bigotimes_{i=1}^{N}U_{i}\times V_{i}}$ as the product of the measures ${\tfrac{1}{\pi(U_{i}\times V_{i})}\pi|_{U_{i}\times V_{i}}}$. Now let ${P^{U_{1}}}$ and ${P^{V_{i}}}$ be the projections of ${X}$ onto ${U_{i}}$ and ${V_{i}}$, respectively, and set

$\displaystyle \nu = \tfrac{\min_{i}\pi(U_{i}\times V_{i})}{n}\sum_{i=1}^{n}(P^{U_{i}},P^{V_{\sigma(i)}})_{\#}\nu - (P^{U_{i}},P^{V_{i}})_{\#}\nu$

where ${_{\#}}$ denotes the pushforward of measures. Note that the new measure ${\nu}$ is signed and that ${\tilde\pi = \pi + \nu}$ fulfills

1. ${\tilde\pi}$ is a non-negative measure
2. ${\tilde\pi}$ is feasible, i.e. has the correct marginals
3. ${\int c\,d\tilde \pi<\int c\,d\pi}$

which, all together, gives a contradiction to optimality of ${\pi}$. The implication of item 1 to item 2 of the theorem is not really related to optimal transport but a general fact about ${c}$-concavity and ${c}$-cyclical monotonicity (c.f.~this previous blog post of mine where I wrote a similar statement for convexity). So let us just prove the implication from item 2 to optimality of ${\pi}$: Let ${\pi}$ fulfill item 2, i.e. ${\pi}$ is feasible and ${\mathrm{supp}(\pi)}$ is contained in the ${c}$-superdifferential of some ${c}$-concave function ${\phi}$. Moreover let ${\tilde\pi}$ be any feasible transport plan. We aim to show that ${\int c\,d\pi\leq \int c\,d\tilde\pi}$. By definition of the ${c}$-superdifferential and the ${c}$-conjugate we have

$\displaystyle \begin{array}{rcl} \phi(x_{1}) + \phi^{c}(x_{2}) &=& c(x_{1},x_{2})\ \forall (x_{1},x_{2})\in\partial^{c}\phi\\ \phi(x_{1}) + \phi^{c}(x_{2}) & \leq& c(x_{1},x_{2})\ \forall (x_{1},x_{2})\in\Omega_{1}\times \Omega_{2}. \end{array}$

Since ${\mathrm{supp}(\pi)\subset\partial^{c}\phi}$ by assumption, this gives

$\displaystyle \begin{array}{rcl} \int_{\Omega_{1}\times \Omega_{2}}c(x_{1},x_{2})\,d\pi(x_{1},x_{2}) & =& \int_{\Omega_{1}\times \Omega_{2}}\phi(x_{1}) + \phi^{c}(x_{1})\,d\pi(x_{1},x_{2})\\ &=& \int_{\Omega_{1}}\phi(x_{1})\,d\mu_{1}(x_{1}) + \int_{\Omega_{1}}\phi^{c}(x_{2})\,d\mu_{2}(x_{2})\\ &=& \int_{\Omega_{1}\times \Omega_{2}}\phi(x_{1}) + \phi^{c}(x_{1})\,d\tilde\pi(x_{1},x_{2})\\ &\leq& \int_{\Omega_{1}\times \Omega_{2}}c(x_{1},x_{2})\,d\tilde\pi(x_{1},x_{2}) \end{array}$

which shows the claim.

${\Box}$

Corollary 6 If ${\pi}$ is a measure on ${\Omega_{1}\times \Omega_{2}}$ which is supported on a ${c}$-superdifferential of a ${c}$-concave function, then ${\pi}$ is an optimal transport plan for its marginals with respect to the transport cost ${c}$.

I can’t claim that I am an expert in machine learning. I’d rather say that I am merely a tourist in this area. Anyway, here is a small piece of thought on how (supervised) machine learning imitates human learning.

What are some features of human learning? Ideally, humans aim to understand a subject. To achieve this, they study examples, try to make their own deductions, do experiments, make predictions and test them. The overall goal is to get to the heart of things.

What are features of so called supervised machine learning: The methods get training data, i.e. pairs in input and output that match. The foremost goal of the method is to perform good on test data, i.e. to produce correct output to an input the method hasn’t seen before. In practice, one sets up a fairly general model (such as a neural network or a kernelized support vector machine) and often does as little modeling of the task at hand as possible.

This does not sound as though supervised machine learning and human learning are the same or even related. Their goals and methods are genuinely different.

But let us look at how human learn for a very specific task: Preparing for an exam. Recently I had to prepare several larger exams in mathematics for engineers, each with hundred plus students and got to think how they approach the task of learning. When the exam comes closer, the interactions with the students get more frequent. I had a large “ask anything” meeting, I had students coming to office hours, and I had “digital office hours” where the students could ask question via a messenger in a chat room. So I had quite some interactions and could get a little insight into their way of learning, into their problems and progress.

Here are some observations of how the students tried to learn: The question I got were mostly about the exercises we had handed out (in other words, the students asked for information on the “training data”). They were studying heavy on these exercises, barely using the textbook or their lecture notes to look up theorems or definitions (in other words, some were working “model free” or with a “general purpose model” which says something like “do computations following general rules”). They work with the underlying assumption that the exam is made up of questions similar to the exercises (and actually, this is a reasonable assumption – I announced this in class) (in other words, the test data comes from the same distribution as the training data).

Viewed like this, learning of humans (for an exam, that is) and machine learning sound much more similar. And the similarities do not end here. Also some known problems with machine learning methods can be observed with the students: Students get stuck in local minima (they reach a point where further improvement in impossible by revising the seen data – even though they could, in principle, learn more from the given exercises, they keep going the known paths, practicing the known computations, not learning new techniques). Students overfit to the training data (on the test data, aka the exam, they face new problems and oftentimes apply the learned methods to tasks where they don’t work, getting wrong results which would be true if the problem would be a little different). The trained students are vulnerable to adversarial attacks (for every problem I posed as exercises I could make a slight change that would confuse most students). Also, similar to recent observations in machine learning, overparametrization helps to avoid overfitting and overparametrization helps to avoid spurious local valleys, i.e. when the students have more techniques at hand, which is related to a more flexible machine learning method, they do better on unseen data and do not get stuck at bad local minima where no improvement is possible.

Granted, some observation are a kind of a stretch, but still, in conclusion, I’d advocate to replace the term “machine learning” with machine cramming (the German version would be maschinelles Büffeln or maschinelles Pauken).

In sparse recovery one aims to leverage the sparsity of some vector ${u}$ to get a good reconstruction of this vector from a noisy and indirect measurement ${u^{0}}$. Sparsity can be expressed in two different ways: The analysis way and the synthesis way. For both of them you need a system of other vectors ${v_{k}}$.

• The analysis way: Here you consider the inner products ${\langle u,v_{k}\rangle}$ and the sparsity assumption is, that this sequence is sparse.
• The synthesis way: Here you assume that there is sparse sequence of coefficients ${\psi_{k}}$ such that ${u=\sum_{k} \psi_{k}v_{k}}$.

The first approach is called “analysis” since here one analyzes the vector ${u}$ with the system of vectors. In the second approach, you use the vectors ${v_{k}}$ to synthesize the vector ${u}$.

We will keep things very simple here and only consider a noisy observation (and not an indirect one). The recovery approach in the analysis way is then to find ${u}$ as a solution of

$\displaystyle \tfrac12\|u-u_{0}\| + \alpha R(\langle u,v_{k}\rangle)$

while for the synthesis way you write ${u = \sum_{k} \psi_{k}v_{k} = V\psi}$ (where we collected the vectors ${v_{k}}$ as the columns of the matrix ${V}$) and solve

$\displaystyle \tfrac12\|V\psi-u_{0}\| + \alpha R(\psi)$

The functional ${R}$ should enforce the sparsity of the vector of coefficients in the former case and the sparsity of ${\psi}$ in the latter case. With the matrix ${V}$ the analysis way reads as

$\displaystyle \tfrac12\|u-u_{0}\| + \alpha R(V^{T}u)$

One important observation is, that both approaches coincide if the matrix ${V}$ is orthonormal: Since ${V^{-1} = V^{T}}$ in this case, ${u=V\psi}$ is equivalent to ${\psi = V^{T}u}$, i.e. ${\psi_{k} = \langle u,v_{k}\rangle}$. For other sets of vectors ${v_{k}}$, this is not true and it is not so clear, how the analysis and the synthesis approach are related. In this post I’d like to show one instance where both approaches do lead to results that are not even close to being similar.

1. The “ROF analysis” approach to denoising

If ${u^{0}}$ is an image, the famous Rudin-Osher-Fatemi denoising model fits under the “analysis” framework: The vectors ${v_{k}}$ are all the finite differences of neighboring pixels such that ${V^{T}u = \nabla u}$ with the discrete gradient ${\nabla u}$. As sparsifying functional you take a mixed ${2}$${1}$ norm, i.e. ${R(\nabla u) = \|\nabla u\|_{2,1} = \sum_{ij}\sqrt{(\partial_{1}u_{ij})^{2} + (\partial_{2}u_{ij})^{2}}}$. The denoised ${u}$ is the solution to

$\displaystyle \min_{u} \tfrac12\|u-u^{0}\|^{2} + \alpha\|\nabla u\|_{2,1}$

Thus, the model will lead to some denoised ${u}$ with a sparse gradient. This sparse gradient is partly the reason for the success of the model in that it keeps edges, but is also responsible for the drawback of this approach: The denoised image will be mostly piesewise constant.

2. The “ROF synthesis” approach

As ${V^{T} = \nabla}$, the corresponding synthesis approach would be to solve

$\displaystyle \min_{\psi}\tfrac12\|\nabla^{T}\psi - u^{0}\|^{2} + \alpha \|\psi\|_{2,1}$

and the set ${u = \nabla^{T}\psi}$. In this approach, the denoised image will be a sparse linear combination of particular two-pixel images. If you think about that, it does not really sound like a good idea to approximate an image by a sparse linear combination of two-pixel images. (One technicality: The operator ${\nabla^{T}}$ is not onto – ${\nabla}$ has a non-trivial kernel, consisting of the constant images and hence, all images in the range of ${\nabla^{T}}$ have zero mean. Thus, to get something remotely close to ${u^{0}}$ one should subtract the mean of ${u^{0}}$ before the reconstruction and add it back afterwards.)

Here are some results:

Note that the results actually fit to what we would have predicted: For larger ${\alpha}$ we get a “sparser gradient” (less edges, less variation) for the analysis approach, and “sparser two-pixel combinations” for the synthesis approach. In fact, for the synthesis approach to give something that is remotely close to the image, one needs quite a lot two-pixel images, i.e. a very low sparsity.

In conclusion, one should not consider the analysis and synthesis approach to be something similar.

Incidentally, the dual of the analysis approach

$\displaystyle \min_{u} \tfrac12\|u-u^{0}\|^{2} + \alpha\|\nabla u\|_{2,1}$

can be written as

$\displaystyle \min_{\|\phi\|_{2,\infty}\leq\alpha}\tfrac12\|\nabla^{T}\phi-u^{0}\|^{2} = \min_{\phi}\tfrac12\|\nabla^{T}\phi-u^{0}\|^{2} + I_{\|\cdot\|_{2,\infty}\leq\alpha}(\phi)$

which looks related to the synthesis approach (one basically replaces the ${2,1}$-norm by its conjugate function). Actually, a similar relation between the dual of the analysis approach and the synthesis approach always holds.

Some remark before you read this post: It is on a very specialized topic and only presents a theoretical insight which seems to be of no practical value whatsoever. Continue at your own risk of wasting your time.

Morozov’s discrepancy principle is a means to choose the regularization parameter when regularizing inverse ill-posed problems. To fix the setting, we consider two Hilbert spaces ${X}$ and ${Y}$ and a bounded linear operator ${A:X\rightarrow Y}$. We assume that the range of ${A}$ is a non-closed subset of ${Y}$. As a consequence of the Bounded Inverse Theorem the pseudo-inverse ${A^\dag}$ (defined on ${{\mathrm rg} A \oplus ({\mathrm rg} A)^\bot}$) is unbounded.

This is the classical setting for linear inverse problems: We have a process, modeled by ${A}$, such that the quantity ${x^\dag\in X}$ we are interested in gives rise to on output ${y^\dag = Ax^\dag}$. We are able to measure ${y}$ but, as it is always the case, the is some noise introduced in the measurement process and hence, we have access to a measured quantity ${y^\delta\in Y}$ which is a perturbation of ${y}$. Our goal is, to approximate ${x^\dag}$ from the knowledge of ${y^\delta}$. Note that simply taking ${A^\dag y^\delta}$ is not an option, since firstly ${y^\delta}$ is not guaranteed to be in the domain of the pseudo-inverse and, somehow even more severely and also more practical, the unboundedness of ${A^\dag}$ will lead to a severe (in theory infinitely large) amplification of the noise, rendering the reconstruction useless.

The key idea is to approximate the pseudo-inverse ${A^\dag}$ by a family of bounded operators ${R_\alpha:Y\rightarrow X}$ with the hope that one may have

$\displaystyle R_\alpha y^\delta \rightarrow x^\dag\ \text{when}\ y^\delta\rightarrow y\in{\mathrm rg} A\ \text{and}\ \alpha\rightarrow 0 \ \text{appropriately.} \ \ \ \ \ (1)$

(Note that the assumption ${\alpha\rightarrow 0}$ is just a convention. It says that we assume that ${R_\alpha}$ is a closer approximation to ${A^\dag}$ the closer ${\alpha}$ is to zero.) Now, we have two tasks:

1. Construct a good family of regularizing operators ${R_\alpha}$ and
2. devise a parameter choice, i.e. a way to choose ${\alpha}$.

The famous Bakushinskii veto says that there is no parameter choice that can guarantee convergence in~(1) in the worst case and only uses the given data ${y^\delta}$. The situation changes if one introduces knowledge about the noise level ${\delta = \|y-y^\delta\|}$. (There is an ongoing debate if it is reasonable to assume that the noise level is known – my experience when working with engineers is that they are usually very good in quantifying the noise present in their system and hence, in my view the assumption that noise level is known is ok.)

One popular way to choose the regularization parameter in dependence on ${y^\delta}$ and ${\delta}$ is Morozov’s discrepancy principle:

Definition 1 Morozov’s discrepancy principle states that one shall choose the regularization parameter ${\alpha(\delta,y^\delta)}$ such that

$\displaystyle \|AR_{\alpha(\delta,y^\delta)}y^\delta - y^\delta\| = c\delta$

for some fixed ${c>1}$.

In other words: You shall choose ${\alpha}$ such that the reconstruction ${R_\alpha y^\delta}$ produces a discrepancy ${\|AR_{\alpha}y^\delta - y^\delta\|}$ which is in the order of and slightly larger than the noise level.

Some years ago I wrote a paper about the use of Morozov’s discrepancy principle when using the augmented Lagragian method (aka Bregman iteration) as an iterative regularization method (where one can view the inverse of the iteration counter as a regularization parameter). The paper is Morozov’s principle for the augmented Lagrangian method applied to linear inverse problems (together with , the arxiv link is here). In that paper we derived an estimate for the (squared) error of ${R_\alpha y^\delta}$ and ${x^\dag}$ that behaves like

$\displaystyle C \frac{c(\sqrt{c}+1)}{\sqrt{c-1}}\delta$

for some ${C>0}$ and the ${c>1}$ from Morozov’s discrepancy principle. The somehow complicated dependence on ${c}$ was a bit puzzling to me. One can optimize ${c>1}$ such that the error estimate is optimal. It turns out that ${c\mapsto \frac{c(\sqrt{c}+1)}{\sqrt{c-1}}}$ attains the minimal value of about ${4.68}$ for about ${c=1.64}$. I blamed the already quite complicated analysis of the augmented Lagragian method for this obviously much too complicated values (and in practice, using ${c}$ much closer to ${1}$ usually lead to much better results).

This term I teach a course on inverse problems and also covered Morozov’s discrepancy principle but this time for much simpler regularization methods, namely for linear methods such as, for example, Tikhonov regularization, i.e.

$\displaystyle R_\alpha y^\delta = (A^* A + \alpha I)^{-1} A^* y^\delta$

(but other linear methods exist). There I arrived at an estimate for the (squared) error of ${R_\alpha y^\delta}$ any ${x^\dag}$ of the form

$\displaystyle C(\sqrt{c} + \tfrac1{\sqrt{c-1}})^2\delta.$

Surprisingly, the dependence on ${c}$ for this basic regularization method is also not very simple. Optimizing the estimate over ${c>1}$ leads to an optimal value of about ${c= 2.32}$ (with a minimal value of the respective constant of about ${5.73}$). Again, using ${c}$ closer to ${1}$ usually lead to much better results.

Well, these observations are not of great importance… I just found it curious to observe that the analysis of Morozov’s discrepancy principle seems to be inherently a bit more complicated than I thought…

Today I’d like to collect some comments one a few papers I stumbled upon recently on the arXiv.

1. TGV minimizers in 1D

First, about a month ago two very similar paper appeared in the same week:

Both papers treat the recently proposed “total generalized variation” model (which is a somehow-but-not-really-higher-order generalization of total variation). The total variation of a function ${u\in L^1(\Omega)}$ (${\Omega\subset{\mathbb R}^d}$) is defined by duality

$\displaystyle TV(u) = \sup\Big\{\int_\Omega \mathrm{div} \phi\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,{\mathbb R}^d), |\phi|\leq 1\Big\}.$

(Note that the demanded high regularity of the test functions ${\phi}$ is not essential here, as we take a supremum over all these functions under the only, but important, requirement that the functions are bounded. Test functions from ${C^1_c(\Omega,{\mathbb R}^d)}$ would also do.)

Several possibilities for extensions and generalization of the total variation exist by somehow including higher order derivatives. The “total generalized variation” is a particular successful approach which reads as (now using two non-negative parameter ${\alpha,\beta}$ which do a weighting):

$\displaystyle TGV_{\beta,\alpha}^2(u) = \sup\Big\{\int_\Omega \mathrm{div}^2 \phi\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,S^{d\times d}),\ |\phi|\leq \beta,\ |\mathrm{div}\phi|\leq \alpha\Big\}.$

To clarify some notation: ${S^{d\times d}}$ are the symmetric ${d\times d}$ matrices, ${\mathrm{div}^n}$ is the negative adjoint of ${\nabla^n}$ which is the differential operator that collects all partial derivatives up to the ${n}$-th order in a ${d\times\cdots\times d}$-tensor. Moreover ${|\phi|}$ is some matrix norm (e.g. the Frobenius norm) and ${|\mathrm{div}\phi|}$ is some vector norm (e.g. the 2-norm).

Both papers investigate so called denoising problems with TGV penalty and ${L^2}$ discrepancy, i.e. minimization problems

$\displaystyle \min_u \frac12\int_\Omega(u-u^0)^2\, dx + TGV_{\alpha,\beta}^2(u)$

for a given ${u^0}$. Moreover, both papers treat the one dimensional case and investigate very special cases in which they calculate minimizers analytically. In one dimension the definition of ${TGV^2}$ becomes a little more familiar:

$\displaystyle TGV_{\beta,\alpha}^2(u) = \sup\Big\{\int_\Omega \phi''\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,{\mathbb R}),\ |\phi|\leq \beta,\ |\phi'|\leq \alpha\Big\}.$

Some images of both papar are really similar: This one from Papafitsoros and Bredies

and this one from Pöschl and Scherzer

Although both paper have a very similar scopes it is worth to read both. The calculations are tedious but both paper try to make them accessible and try hard (and did a good job) to provide helpful illustrations. Curiously, the earlier paper cites the later one but not conversely…

2. Generalized conditional gradient methods

Another paper I found very interesting was

This paper shows a nice duality which I haven’t been aware of, namely the one between the subgradient descent methods and conditional gradient methods. In fact the conditional gradient method which is treated is a generalization of the conditional gradient method which Kristian and I also proposed a while ago in the context of ${\ell^1}$-minimization in the paper Iterated hard shrinkage for minimization problems with sparsity constraints: To minimize the sum

$\displaystyle F(u) + \Phi(u)$

with a differentiable ${F}$ and a convex ${\Phi}$ for which the subgradient of ${\Phi}$ is easily invertible (or, put differently, for which you can minimize ${\langle u,a\rangle + \Phi(u)}$ easily), perform the following iteration:

1. At iterate ${u^n}$ linearize ${F}$ but not ${\Phi}$ and calculate a new point ${v^n}$ by

$\displaystyle v^n = \mathrm{argmin}_v \langle F'(u^n),v\rangle + \Phi(v)$

2. Choose a stepsize ${s^n\in [0,1]}$ and set the next iterate as a convex combination of ${u^n}$ and ${v^n}$

$\displaystyle u^{n+1} = u^n + s_n(v^n - u^n).$

Note that for and indicator function

$\displaystyle \Phi(u) = \begin{cases} 0 & u\in C\\ \infty & \text{else} \end{cases}$

you obtain the conditional gradient method (also known as Frank-Wolfe method). While Kristian and I derived convergence with an asymptotic rate for the case of ${F(u) = \tfrac12\|Ku-f\|^2}$ and ${\Phi}$ strongly coercive, Francis uses the formulation ${F(u) = f(Au)}$ the assumption that the dual ${f^*}$ of ${f}$ has a bounded effective domain (which say that ${f}$ has linear growth in all directions). With this assumption he obtains explicit constants and rates also for the primal-dual gap. It was great to see that eventually somebody really took the idea from the paper Iterated hard shrinkage for minimization problems with sparsity constraints (and does not think that we do heuristics for ${\ell^0}$ minimization…).

A quick post to keep track of several things:

• Christian Leonard  has lecture notes on convex optimization with an application to optimal transport on his website.
• The paper Variational Properties of Value Functions by  Aravkin, Burke, and Friedlander discuss how the value of minimization problems like $\min \rho(Ax-b)\quad \mbox{s.t}\quad \phi(x)\leq tau$ depend on $\tau$ and $\latex b$. In inverse problems, the value function seems to contain important information on the the regularization process and hence, the results in this paper maybe helpful in designing and analyzing parameter choice rules.
• The paper Accelerated and Inexact Forward-Backward Algorithms by Villa,Salzo, Baldassarre, and Verri looks like an interesting development in the fiel of splitting methods.
• The paper  Consistency of the posterior distribution in generalized linear inverse problems by Natalia Bochkina is another contribution on “probabilitic inverse problems” where one does not only try to infer a regularized solution to an ill posed problems but also how the uncertainty in the data in propagated through the regularization process.

Another few notes to myself:

Next Page »