### Uncategorized

This is a short follow up on my last post where I wrote about the sweet spot of the stepsize of the Douglas-Rachford iteration. For the case $\beta$-Lipschitz + $\mu$-strongly monotone, the iteration with stepsize $t$ converges linear with rate

$\displaystyle r(t) = \tfrac{1}{2(1+t\mu)}\left(\sqrt{2t^{2}\mu^{2}+2t\mu + 1 +2(1 - \tfrac{1}{(1+t\beta)^{2}} - \tfrac1{1+t^{2}\beta^{2}})t\mu(1+t\mu)} + 1\right)$

Here is animated plot of this contraction factor depending on $\beta$ and $\mu$ and $t$ acts as time variable:

What is interesting is, that this factor has increasing or decreasing in $t$ depending on the values of $\beta$ and $\mu$.

For each pair $(\beta,\mu)$ there is a best $t^*$ and also a smallest contraction factor $r(t^*)$. Here are plots of these quantities:

Comparing the plot of te optimal contraction factor to the animated plot above, you see that the right choice of the stepsize matters a lot.

The fundamental theorem of calculus relates the derivative of a function to the function itself via an integral. A little bit more precise, it says that one can recover a function from its derivative up to an additive constant (on a simply connected domain).

In one space dimension, one can fix some ${x_{0}}$ in the domain (which has to be an interval) and then set

$\displaystyle F(x) = \int_{x_{0}}^{x}f'(t) dt.$

Then ${F(x) = f(x) + c}$ with ${c= - f(x_{0})}$.

Actually, a similar claim is true in higher space dimensions: If ${f}$ is defined on a simply connected domain in ${{\mathbb R}^{d}}$ we can recover ${f}$ from its gradient up to an additive constant as follows: Select some ${x_{0}}$ and set

$\displaystyle F(x) = \int_{\gamma}\nabla f(x)\cdot dx \ \ \ \ \ (1)$

for any path ${\gamma}$ from ${x_{0}}$ to ${x}$. Then it holds under suitable conditions that

$\displaystyle F(x) = f(x) - f(x_{0}).$

And now for something completely different: Convex functions and subgradients.

A function ${f}$ on ${{\mathbb R}^{n}}$ is convex if for every ${x}$ there exists a subgradient ${x^{*}\in{\mathbb R}^{n}}$ such that for all ${y}$ one has the subgradient inequality

$\displaystyle f(y) \geq f(x) + \langle x^{*}, y-x\rangle.$

Writing this down for ${x}$ and ${y}$ with interchanged roles (and ${y^{*}}$ as corresponding subgradient to ${y}$), we see that

$\displaystyle \langle x-y,x^{*}-y^{*}\rangle \geq 0.$

In other words: For a convex function ${f}$ it holds that the subgradient ${\partial f}$ is a (multivalued) monotone mapping. Recall that a multivalued map ${A}$ is monotone, if for every ${y \in A(x)}$ and ${y^{*}\in A(y)}$ it holds that ${\langle x-y,x^{*}-y^{*}\rangle \geq 0}$. It is not hard to see that not every monotone map is actually a subgradient of a convex function (not even, if we go to “maximally monotone maps”, a notion that we sweep under the rug in this post). A simple counterexample is a (singlevalued) linear monotone map represented by

$\displaystyle \begin{bmatrix} 0 & 1\\ -1 & 0 \end{bmatrix}$

(which can not be a subgradient of some ${f}$, since it is not symmetric).

Another hint that monotonicity of a map does not imply that the map is a subgradient is that subgradients have some stronger properties than monotone maps. Let us write down the subgradient inequalities for a number of points ${x_{0},\dots x_{n}}$:

$\displaystyle \begin{array}{rcl} f(x_{1}) & \geq & f(x_{0}) + \langle x_{0}^{*},x_{1}-x_{0}\rangle\\ f(x_{2}) & \geq & f(x_{1}) + \langle x_{1}^{*},x_{2}-x_{1}\rangle\\ \vdots & & \vdots\\ f(x_{n}) & \geq & f(x_{n-1}) + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle\\ f(x_{0}) & \geq & f(x_{n}) + \langle x_{n}^{*},x_{0}-x_{n}\rangle. \end{array}$

If we sum all these up, we obtain

$\displaystyle 0 \geq \langle x_{0}^{*},x_{1}-x_{0}\rangle + \langle x_{1}^{*},x_{2}-x_{1}\rangle + \cdots + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle + \langle x_{n}^{*},x_{0}-x_{n}\rangle.$

This property is called ${n}$-cyclical monotonicity. In these terms we can say that a subgradient of a convex function is cyclical monotone, which means that it is ${n}$-cyclically monotone for every integer ${n}$.

By a remarkable result by Rockafellar, the converse is also true:

Theorem 1 (Rockafellar, 1966) Let ${A}$ by a cyclically monotone map. Then there exists a convex function ${f}$ such that ${A = \partial f}$.

Even more remarkable, the proof is somehow “just an application of the fundamental theorem of calculus” where we recover a function by its subgradient (up to an additive constant that depends on the basepoint).

Proof: we aim to “reconstruct” ${f}$ from ${A = \partial f}$. The trick is to choose some base point ${x_{0}}$ and corresponding ${x_{0}^{*}\in A(x_{0})}$ and set

$\displaystyle \begin{array}{rcl} f(x) &=& \sup\left\{ \langle x_{0}^{*}, x_{1}-x_{0}\rangle + \langle x_{1}^{*},x_{2}-x_{1}\rangle + \cdots + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle\right.\\&& \qquad\left. + \langle x_{n}^{*},x-x_{n}\rangle\right\}\qquad\qquad (0)\end{array}$

where the supremum is taken over all integers ${m}$ and all pairs ${x_{i}^{*}\in A(x_{i})}$ ${i=1,\dots, m}$. As a supremum of affine functions ${f}$ is clearly convex (even lower semicontinuous) and ${f(x_{0}) = 0}$ since ${A}$ is cyclically monotone (this shows that ${f}$ is proper, i.e. not equal to ${\infty}$ everywhere). Finally, for ${\bar x^{*}\in A(\bar x)}$ we have

$\displaystyle \begin{array}{rcl} f(x) &\geq& \sup\left\{ \langle x_{0}^{*}, x_{1}-x_{0}\rangle + \langle x_{1}^{*},x_{2}-x_{1}\rangle + \cdots + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle\right.\\ & & \qquad \left.+ \langle x_{n}^{*},\bar x-x_{n}\rangle + \langle \bar x^{*},\bar x-\bar x\rangle\right\} \end{array}$

with the supremum taken over all integers ${m}$ and all pairs ${x_{i}^{*}\in A(x_{i})}$ ${i=1,\dots, m}$. The right hand side is equal to ${f(x) + \langle \bar x^{*},x-\bar x\rangle}$ and this shows that ${f}$ is indeed convex. $\Box$

Where did we use the fundamental theorem of calculus? Let us have another look at equation~(0). Just for simplicity, let us denote ${x_{i}^{*} =\nabla f(x_{i})}$. Now consider a path ${\gamma}$ from ${x_{0}}$ to ${x}$ and points ${0=t_{0}< t_{1}<\cdots < t_{n}< t_{n+1} = 1}$ with ${\gamma(t_{i}) = x_{i}}$. Then the term inside the supremum of~(0) equals

$\displaystyle \langle \nabla f(\gamma(t_{0})),\gamma(t_{1})-\gamma(t_{0})\rangle + \dots + \langle \nabla f(\gamma(t_{n})),\gamma(t_{n+1})-\gamma(t_{n})\rangle.$

This is Riemannian sum for an integral of the form ${\int_{\gamma}\nabla f(x)\cdot dx}$. By monotonicity of ${f}$, we increase this sum, if we add another point ${\bar t}$ (e.g. ${t_{i}<\bar t, and hence, the supremum does converge to the integral, i.e.~(0) is equal to

$\displaystyle f(x) = \int_{\gamma}\nabla f(x)\cdot dx$

where ${\gamma}$ is any path from ${x_{0}}$ to ${x}$.

In my last blog post I wrote about Luxemburg norms which are constructions to get norms out of a non-homogeneous function ${\phi:[0,\infty[\rightarrow[0,\infty[}$ which satisfies ${\phi(0) = 0}$ and are increasing and convex (and thus, continuous). The definition of the Luxemburg norm in this case is

$\displaystyle \|x\|_{\phi} := \inf\left\{\lambda>0\ :\ \sum_{k}\phi\left(\frac{|x_{k}|}{\lambda}\right)\leq 1\right\},$

and we saw that ${\|x\|_{\phi} = c}$ if ${\sum_{k}\phi\left(\frac{|x_{k}|}{c}\right)= 1}$.

Actually, one can have a little more flexibility in the construction as one can also use different functions ${\phi}$ in each coordinate: If ${\phi_{k}}$ are functions as ${\phi}$ above, we can define

$\displaystyle \|x\|_{\phi_{k}} := \inf\left\{\lambda>0\ :\ \sum_{k}\phi_{k}\left(\frac{|x_{k}|}{\lambda}\right)\leq 1\right\},$

and it still holds that ${\|x\|_{\phi_{k}} = c}$ if and only if ${\sum_{k}\phi_{k}\left(\frac{|x_{k}|}{c}\right)= 1}$. The proof that this construction indeed gives a norm is the same as in the one in the previous post.

This construction allows, among other things, to construct norms that behave like different ${p}$-norms in different directions. Here is a simple example: In the case of ${x\in{\mathbb R}^{d}}$ we can split the variables into two groups, say the first ${k}$ variables and the last ${d-k}$ variables. The first group shall be treated with a ${p}$-norm and the second group with a ${q}$-norm. For the respective Luxemburg norm one has

$\displaystyle \|x\|:= \inf\left\{\lambda>0\ :\ \sum_{i=1}^{k}\frac{|x_{i}|^{p}}{\lambda^{p}} + \sum_{i=k+1}^{d}\frac{|x_{i}|^{q}}{\lambda^{q}}\leq 1\right\},$

Note that there is a different way to do a similar thing, namely a mixed norm defined as

$\displaystyle \|x\|_{p,q}:= \left(\sum_{i=1}^{k}|x_{i}|^{p}\right)^{1/p} + \left(\sum_{i=k+1}^{d}|x_{i}|^{q}\right)^{1/q}.$

As any two norms, these are equivalent, but they induce a different geometry. On top of that, one could in principle also consider ${\Phi}$ functionals

$\displaystyle \Phi_{p,q}(x) = \sum_{i=1}^{k}|x_{i}|^{p} + \sum_{i=k+1}^{d}|x_{i}|^{q},$

which is again something different.

A bit more general, we may consider all these three conditions for general partitions of the index sets and a different exponent for each set.

Here are some observations on the differences:

• For the Luxemburg norm the only thing that counts are the exponents (or functions ${\phi_{k}}$). If you partition the index set into two parts but give the same exponents to both, the Luxemburg norm is the same as if you would consider the two parts as just one part.
• The mixed norm is not the ${p}$-norm, even if the set the exponent to ${p}$ for every part.
• The Luxemburg norm has the flexibility to use other functionals than just the powers.
• For the mixed norms one could consider additional mixing by not just summing the norms of the different parts, which is the same as taking the ${1}$-norm of the vector of norms. Of course, other norms are possible, e.g. constructions like

$\displaystyle \left(\left(\sum_{i=1}^{k}|x_{i}|^{p}\right)^{r/p} + \left(\sum_{i=k+1}^{d}|x_{i}|^{q}\right)^{r/q}\right)^{1/r}$

are also norms. (Actually, the term mixed norm is often used for the above construction with ${p=q\neq r}$.)

Here are some pictures that show the different geometry that these three functionals induce. We consider ${d=3}$ i.e., three-dimensional space, and picture the norm-balls (of level sets in the case the functionals ${\Phi}$).

• Consider the case ${k=1}$ and the first exponent to be ${p=1}$ and the second ${q=2}$. The mixed norm is

$\displaystyle \|x\|_{1,2} = |x_{1}| + \sqrt{x_{2}^{2}+x_{3}^{2}},$

the ${\Phi}$-functional is

$\displaystyle \Phi(x)_{1,2} = |x_{1}| + x_{2}^{2}+x_{3}^{2},$

and for the Luxemburg norm it holds that

$\displaystyle \|x\| = c\iff \frac{|x_{1}|}{c} + \frac{x_{2}^{2} + x_{3}^{2}}{c^{2}} = 1.$

Here are animated images of the respective level sets/norm balls for radii ${0.25, 0.5, 0.75,\dots,3}$:

You may note the different shapes of the balls of the mixed norm and the Luxemburg norm. Also, the shape of their norm balls stays the same as you scale the radius. The last observation is not true for the ${\Phi}$ functional: Different directions scale differently.

• Now consider ${k=2}$ and the same exponents. This makes the mixed norm equal to the ${1}$-norm, since

$\displaystyle \|x\|_{1,2} = |x_{1}| + |x_{2}| + \sqrt{x_{3}^{2}} = \|x\|_{1}.$

The ${\Phi}$-functional is

$\displaystyle \Phi(x)_{1,2} = |x_{1}| + |x_{2}|+x_{3}^{2},$

and for the Luxemburg norm it holds that

$\displaystyle \|x\| = c\iff \frac{|x_{1}| + |x_{2}|}{c} + \frac{x_{3}^{2}}{c^{2}} = 1.$

Here are animated images of the respective level sets/norm balls of the ${\Phi}$ functional and the Luxemburg norm for the same radii as above (I do not show the balls for the mixed norm – they are just the standard cross-polytopes/${1}$-norm balls/octahedra): Again note how the Luxemburg ball keeps its shape while the level sets of the ${\Phi}$-functional changes shape while scaling.

• Now we consider three different exponents: ${p_{1}=1}$, ${p_{2} = 2}$ and ${p_{3} = 3}$. The mixed norm is again the ${1}$-norm. The ${\Phi}$-functional is

$\displaystyle \Phi(x)_{1,2} = |x_{1}| + x_{2}^{2}+|x_{3}|^{3},$

and for the Luxemburg norm it holds that

$\displaystyle \|x\| = c\iff \frac{|x_{1}|}{c} + \frac{x_{2}^{2}}{c^{2}} + \frac{|x_{3}|^{3}}{c^{3}} = 1.$

Here are animated images of the respective level sets/norm balls of the ${\Phi}$ functional and the Luxemburg norm for the same radii as above (again, the balls for the mixed norm are just the standard cross-polytopes/${1}$-norm balls/octahedra):

A function ${\phi:[0,\infty[\rightarrow[0,\infty[}$ is called ${p}$-homogeneous, if for any ${t>0}$ it holds that ${\phi(tx) = t^{p}\phi(x)}$. This implies that ${\phi(x) = x^{p}\phi(1)}$ and thus, all ${p}$-homogeneous functions on the positive reals are just multiples of powers of ${p}$. If you have such a power of ${p}$ you can form the ${p}$-norm

$\displaystyle \|x\|_{p} = \left(\sum_{k}|x_{k}|^{p}\right)^{1/p}.$

By Minkowski’s inequality, this in indeed a norm for ${p\geq 1}$.

If we have just some function ${\phi:[0,\infty[\rightarrow[0,\infty[}$ that is not homogeneous, we could still try to do a similar thing and consider

$\displaystyle \phi^{-1}\left(\sum_{k}\phi(|x_{k}|)\right).$

It is easy to see that one needs ${\phi(0)=0}$ and ${\phi}$ increasing and invertible to have any chance that this expression can be a norm. However, one usually does not get positive homogeneity of the expression, i.e. in general

$\displaystyle \phi^{-1}\left(\sum_{k}\phi(t|x_{k}|)\right)\neq t \phi^{-1}\left(\sum_{k}\phi(|x_{k}|)\right).$

A construction that helps in this situation is the Luxemburg-norm. The definition is as follows:

Definition 1 (and lemma). Let ${\phi:[0,\infty[\rightarrow[0,\infty[}$ fulfill ${\phi(0)=0}$, ${\phi}$ be increasing and convex. Then we define the Luxemburg norm for ${\phi}$ as

$\displaystyle \|x\|_{\phi} := \inf\{\lambda>0\ :\ \sum_{k}\phi\left(\frac{|x_{k}|}{\lambda}\right)\leq 1\}.$

Let’s check if this really is a norm. To do so we make the following observation:

Lemma 2 If ${x\neq 0}$, then ${c = \|x\|_{\phi}}$ if and only if ${\sum_{k}\phi\left(\frac{|x_{k}|}{c}\right) = 1}$.

Proof: Basically follows by continuity of ${\phi}$ from the fact that for ${\lambda >c}$ we have ${\sum_{k}\phi\left(\frac{|x_{k}|}{\lambda}\right) \leq 1}$ and for ${\lambda we have ${\sum_{k}\phi\left(\frac{|x_{k}|}{\lambda}\right) > 1}$. $\Box$

Lemma 3 ${\|x\|_{\phi}}$ is a norm on ${{\mathbb R}^{d}}$.

Proof: For ${x=0}$ we easily see that ${\|x\|_{\phi}=0}$ (since ${\phi(0)=0}$). Conversely, if ${\|x\|_{\phi}=0}$, then ${\limsup_{\lambda\rightarrow 0}\sum_{k}\phi\left(\tfrac{|x_{k}|}{\lambda}\right) \leq 1}$ but since ${\lim_{t\rightarrow\infty}\phi(t) = \infty}$ this can only hold if ${x_{1}=\cdots=x_{d}= 0}$. For positive homogeneity observe

$\displaystyle \begin{array}{rcl} \|tx\|_{\phi} & = & \inf\{\lambda>0\ :\ \sum_{k}\phi\left(\frac{|tx_{k}|}{\lambda}\right)\leq 1\}\\ & = & \inf\{|t|\mu>0\ :\ \sum_{k}\phi\left(\frac{|x_{k}|}{\mu}\right)\leq 1\}\\ & = & |t|\|x\|_{\phi}. \end{array}$

For the triangle inequality let ${c = \|x\|_{\phi}}$ and ${d = \|y\|_{\phi}}$ (which implies that ${\sum_{k}\phi\left(\tfrac{|x_{k}|}{c}\right)\leq 1}$ and ${\sum_{k}\phi\left(\tfrac{|y_{k}|}{d}\right)\leq 1}$). Then it follows

$\displaystyle \begin{array}{rcl} \sum_{k}\phi\left(\tfrac{|x_{k}+y_{k}|}{c+d}\right) &\leq& \sum_{k}\phi\left(\tfrac{c}{c+d}\tfrac{|x_{k}|}{c} +\tfrac{d}{c+d}\tfrac{|y_{k}|}{d}\right)\\ &\leq& \tfrac{c}{c+d}\underbrace{\sum_{k} \phi\left(\tfrac{|x_{k}|}{c}\right)}_{\leq 1} + \tfrac{d}{c+d}\underbrace{\sum_{k}\phi\left(\tfrac{|y_{k}|}{d}\right)}_{\leq 1}\\ &\leq& 1 \end{array}$

and this implies that ${c+d \geq \|x+y\|_{\phi}}$ as desired. $\Box$

As a simple exercise you can convince yourself that ${\phi(t) = t^{p}}$ lead to ${\|x\|_{\phi} = \|x\|_{p}}$.

Let us see how the Luxemburg norm looks for other functions.

Example 1 Let ‘s take ${\phi(t) = \exp(t)-1}$.

The function ${\phi}$ fulfills the conditions we need and here are the level lines of the functional ${x\mapsto \phi^{-1}\left(\sum_{k}\phi(|x_{k}|)\right)}$ (which is not a norm):

[Levels are ${0.5, 1, 2, 3}$]

The picture shows that this functional is not a norm ad the shape of the “norm-balls” changes with the size. In contrast to that, the level lines of the respective Luxemburg norm look like this:

[Levels are ${0.5, 1, 2, 3}$]

I have on open position for a PhD student – here is the official job-ad:

The group of Prof. Dirk Lorenz at the Institute of Analysis and Algebra has an open PhD position for a Scientific Assistant (75\% TV-L EG 13). The position is available as soon as possible and is initially limited to three years.

The scientific focus of the group includes optimization for inverse problems and machine learning, and mathematical imaging. Besides teaching and research, the position includes work in projects or preperation of projects.

We offer

• a dynamic team and a creative research and work environment
• mentoring and career planning programs (offered by TU Braunschweig), possibilities for personal qualification, language courses and the possibility to participate in international conferences
• flexible work hours and a family friendly work environment.

We are looking for candidates with

• a degree (Masters or Diploma) in mathematics above average,
• a focus on optimization and/or numerical mathematics and applications, e.g. imaging or machine learning
• programming skills in MATLAB, Julia and/or Python
• good knowledge of German and/or English
• capacity for teamwork, independent work, high level of motivation and organizational talent

Equally qualified severely challenged persons will be given preference. The TU Braunschweig especially encourages women to apply for the position. Please send your application including CV, copies of certificates and letters of recommendation (if any) in electronic form via e-mail to Dirk Lorenz.

Contact Prof. Dirk Lorenz | Tel. +49 531 391 7423| d.lorenz@tu-braunschweig.de

I can’t claim that I am an expert in machine learning. I’d rather say that I am merely a tourist in this area. Anyway, here is a small piece of thought on how (supervised) machine learning imitates human learning.

What are some features of human learning? Ideally, humans aim to understand a subject. To achieve this, they study examples, try to make their own deductions, do experiments, make predictions and test them. The overall goal is to get to the heart of things.

What are features of so called supervised machine learning: The methods get training data, i.e. pairs in input and output that match. The foremost goal of the method is to perform good on test data, i.e. to produce correct output to an input the method hasn’t seen before. In practice, one sets up a fairly general model (such as a neural network or a kernelized support vector machine) and often does as little modeling of the task at hand as possible.

This does not sound as though supervised machine learning and human learning are the same or even related. Their goals and methods are genuinely different.

But let us look at how human learn for a very specific task: Preparing for an exam. Recently I had to prepare several larger exams in mathematics for engineers, each with hundred plus students and got to think how they approach the task of learning. When the exam comes closer, the interactions with the students get more frequent. I had a large “ask anything” meeting, I had students coming to office hours, and I had “digital office hours” where the students could ask question via a messenger in a chat room. So I had quite some interactions and could get a little insight into their way of learning, into their problems and progress.

Here are some observations of how the students tried to learn: The question I got were mostly about the exercises we had handed out (in other words, the students asked for information on the “training data”). They were studying heavy on these exercises, barely using the textbook or their lecture notes to look up theorems or definitions (in other words, some were working “model free” or with a “general purpose model” which says something like “do computations following general rules”). They work with the underlying assumption that the exam is made up of questions similar to the exercises (and actually, this is a reasonable assumption – I announced this in class) (in other words, the test data comes from the same distribution as the training data).

Viewed like this, learning of humans (for an exam, that is) and machine learning sound much more similar. And the similarities do not end here. Also some known problems with machine learning methods can be observed with the students: Students get stuck in local minima (they reach a point where further improvement in impossible by revising the seen data – even though they could, in principle, learn more from the given exercises, they keep going the known paths, practicing the known computations, not learning new techniques). Students overfit to the training data (on the test data, aka the exam, they face new problems and oftentimes apply the learned methods to tasks where they don’t work, getting wrong results which would be true if the problem would be a little different). The trained students are vulnerable to adversarial attacks (for every problem I posed as exercises I could make a slight change that would confuse most students). Also, similar to recent observations in machine learning, overparametrization helps to avoid overfitting and overparametrization helps to avoid spurious local valleys, i.e. when the students have more techniques at hand, which is related to a more flexible machine learning method, they do better on unseen data and do not get stuck at bad local minima where no improvement is possible.

Granted, some observation are a kind of a stretch, but still, in conclusion, I’d advocate to replace the term “machine learning” with machine cramming (the German version would be maschinelles Büffeln or maschinelles Pauken).

In sparse recovery one aims to leverage the sparsity of some vector ${u}$ to get a good reconstruction of this vector from a noisy and indirect measurement ${u^{0}}$. Sparsity can be expressed in two different ways: The analysis way and the synthesis way. For both of them you need a system of other vectors ${v_{k}}$.

• The analysis way: Here you consider the inner products ${\langle u,v_{k}\rangle}$ and the sparsity assumption is, that this sequence is sparse.
• The synthesis way: Here you assume that there is sparse sequence of coefficients ${\psi_{k}}$ such that ${u=\sum_{k} \psi_{k}v_{k}}$.

The first approach is called “analysis” since here one analyzes the vector ${u}$ with the system of vectors. In the second approach, you use the vectors ${v_{k}}$ to synthesize the vector ${u}$.

We will keep things very simple here and only consider a noisy observation (and not an indirect one). The recovery approach in the analysis way is then to find ${u}$ as a solution of

$\displaystyle \tfrac12\|u-u_{0}\| + \alpha R(\langle u,v_{k}\rangle)$

while for the synthesis way you write ${u = \sum_{k} \psi_{k}v_{k} = V\psi}$ (where we collected the vectors ${v_{k}}$ as the columns of the matrix ${V}$) and solve

$\displaystyle \tfrac12\|V\psi-u_{0}\| + \alpha R(\psi)$

The functional ${R}$ should enforce the sparsity of the vector of coefficients in the former case and the sparsity of ${\psi}$ in the latter case. With the matrix ${V}$ the analysis way reads as

$\displaystyle \tfrac12\|u-u_{0}\| + \alpha R(V^{T}u)$

One important observation is, that both approaches coincide if the matrix ${V}$ is orthonormal: Since ${V^{-1} = V^{T}}$ in this case, ${u=V\psi}$ is equivalent to ${\psi = V^{T}u}$, i.e. ${\psi_{k} = \langle u,v_{k}\rangle}$. For other sets of vectors ${v_{k}}$, this is not true and it is not so clear, how the analysis and the synthesis approach are related. In this post I’d like to show one instance where both approaches do lead to results that are not even close to being similar.

1. The “ROF analysis” approach to denoising

If ${u^{0}}$ is an image, the famous Rudin-Osher-Fatemi denoising model fits under the “analysis” framework: The vectors ${v_{k}}$ are all the finite differences of neighboring pixels such that ${V^{T}u = \nabla u}$ with the discrete gradient ${\nabla u}$. As sparsifying functional you take a mixed ${2}$${1}$ norm, i.e. ${R(\nabla u) = \|\nabla u\|_{2,1} = \sum_{ij}\sqrt{(\partial_{1}u_{ij})^{2} + (\partial_{2}u_{ij})^{2}}}$. The denoised ${u}$ is the solution to

$\displaystyle \min_{u} \tfrac12\|u-u^{0}\|^{2} + \alpha\|\nabla u\|_{2,1}$

Thus, the model will lead to some denoised ${u}$ with a sparse gradient. This sparse gradient is partly the reason for the success of the model in that it keeps edges, but is also responsible for the drawback of this approach: The denoised image will be mostly piesewise constant.

2. The “ROF synthesis” approach

As ${V^{T} = \nabla}$, the corresponding synthesis approach would be to solve

$\displaystyle \min_{\psi}\tfrac12\|\nabla^{T}\psi - u^{0}\|^{2} + \alpha \|\psi\|_{2,1}$

and the set ${u = \nabla^{T}\psi}$. In this approach, the denoised image will be a sparse linear combination of particular two-pixel images. If you think about that, it does not really sound like a good idea to approximate an image by a sparse linear combination of two-pixel images. (One technicality: The operator ${\nabla^{T}}$ is not onto – ${\nabla}$ has a non-trivial kernel, consisting of the constant images and hence, all images in the range of ${\nabla^{T}}$ have zero mean. Thus, to get something remotely close to ${u^{0}}$ one should subtract the mean of ${u^{0}}$ before the reconstruction and add it back afterwards.)

Here are some results:

Note that the results actually fit to what we would have predicted: For larger ${\alpha}$ we get a “sparser gradient” (less edges, less variation) for the analysis approach, and “sparser two-pixel combinations” for the synthesis approach. In fact, for the synthesis approach to give something that is remotely close to the image, one needs quite a lot two-pixel images, i.e. a very low sparsity.

In conclusion, one should not consider the analysis and synthesis approach to be something similar.

Incidentally, the dual of the analysis approach

$\displaystyle \min_{u} \tfrac12\|u-u^{0}\|^{2} + \alpha\|\nabla u\|_{2,1}$

can be written as

$\displaystyle \min_{\|\phi\|_{2,\infty}\leq\alpha}\tfrac12\|\nabla^{T}\phi-u^{0}\|^{2} = \min_{\phi}\tfrac12\|\nabla^{T}\phi-u^{0}\|^{2} + I_{\|\cdot\|_{2,\infty}\leq\alpha}(\phi)$

which looks related to the synthesis approach (one basically replaces the ${2,1}$-norm by its conjugate function). Actually, a similar relation between the dual of the analysis approach and the synthesis approach always holds.

Next Page »