### Imaging

In sparse recovery one aims to leverage the sparsity of some vector ${u}$ to get a good reconstruction of this vector from a noisy and indirect measurement ${u^{0}}$. Sparsity can be expressed in two different ways: The analysis way and the synthesis way. For both of them you need a system of other vectors ${v_{k}}$.

• The analysis way: Here you consider the inner products ${\langle u,v_{k}\rangle}$ and the sparsity assumption is, that this sequence is sparse.
• The synthesis way: Here you assume that there is sparse sequence of coefficients ${\psi_{k}}$ such that ${u=\sum_{k} \psi_{k}v_{k}}$.

The first approach is called “analysis” since here one analyzes the vector ${u}$ with the system of vectors. In the second approach, you use the vectors ${v_{k}}$ to synthesize the vector ${u}$.

We will keep things very simple here and only consider a noisy observation (and not an indirect one). The recovery approach in the analysis way is then to find ${u}$ as a solution of

$\displaystyle \tfrac12\|u-u_{0}\| + \alpha R(\langle u,v_{k}\rangle)$

while for the synthesis way you write ${u = \sum_{k} \psi_{k}v_{k} = V\psi}$ (where we collected the vectors ${v_{k}}$ as the columns of the matrix ${V}$) and solve

$\displaystyle \tfrac12\|V\psi-u_{0}\| + \alpha R(\psi)$

The functional ${R}$ should enforce the sparsity of the vector of coefficients in the former case and the sparsity of ${\psi}$ in the latter case. With the matrix ${V}$ the analysis way reads as

$\displaystyle \tfrac12\|u-u_{0}\| + \alpha R(V^{T}u)$

One important observation is, that both approaches coincide if the matrix ${V}$ is orthonormal: Since ${V^{-1} = V^{T}}$ in this case, ${u=V\psi}$ is equivalent to ${\psi = V^{T}u}$, i.e. ${\psi_{k} = \langle u,v_{k}\rangle}$. For other sets of vectors ${v_{k}}$, this is not true and it is not so clear, how the analysis and the synthesis approach are related. In this post I’d like to show one instance where both approaches do lead to results that are not even close to being similar.

1. The “ROF analysis” approach to denoising

If ${u^{0}}$ is an image, the famous Rudin-Osher-Fatemi denoising model fits under the “analysis” framework: The vectors ${v_{k}}$ are all the finite differences of neighboring pixels such that ${V^{T}u = \nabla u}$ with the discrete gradient ${\nabla u}$. As sparsifying functional you take a mixed ${2}$${1}$ norm, i.e. ${R(\nabla u) = \|\nabla u\|_{2,1} = \sum_{ij}\sqrt{(\partial_{1}u_{ij})^{2} + (\partial_{2}u_{ij})^{2}}}$. The denoised ${u}$ is the solution to

$\displaystyle \min_{u} \tfrac12\|u-u^{0}\|^{2} + \alpha\|\nabla u\|_{2,1}$

Thus, the model will lead to some denoised ${u}$ with a sparse gradient. This sparse gradient is partly the reason for the success of the model in that it keeps edges, but is also responsible for the drawback of this approach: The denoised image will be mostly piesewise constant.

2. The “ROF synthesis” approach

As ${V^{T} = \nabla}$, the corresponding synthesis approach would be to solve

$\displaystyle \min_{\psi}\tfrac12\|\nabla^{T}\psi - u^{0}\|^{2} + \alpha \|\psi\|_{2,1}$

and the set ${u = \nabla^{T}\psi}$. In this approach, the denoised image will be a sparse linear combination of particular two-pixel images. If you think about that, it does not really sound like a good idea to approximate an image by a sparse linear combination of two-pixel images. (One technicality: The operator ${\nabla^{T}}$ is not onto – ${\nabla}$ has a non-trivial kernel, consisting of the constant images and hence, all images in the range of ${\nabla^{T}}$ have zero mean. Thus, to get something remotely close to ${u^{0}}$ one should subtract the mean of ${u^{0}}$ before the reconstruction and add it back afterwards.)

Here are some results:

Note that the results actually fit to what we would have predicted: For larger ${\alpha}$ we get a “sparser gradient” (less edges, less variation) for the analysis approach, and “sparser two-pixel combinations” for the synthesis approach. In fact, for the synthesis approach to give something that is remotely close to the image, one needs quite a lot two-pixel images, i.e. a very low sparsity.

In conclusion, one should not consider the analysis and synthesis approach to be something similar.

Incidentally, the dual of the analysis approach

$\displaystyle \min_{u} \tfrac12\|u-u^{0}\|^{2} + \alpha\|\nabla u\|_{2,1}$

can be written as

$\displaystyle \min_{\|\phi\|_{2,\infty}\leq\alpha}\tfrac12\|\nabla^{T}\phi-u^{0}\|^{2} = \min_{\phi}\tfrac12\|\nabla^{T}\phi-u^{0}\|^{2} + I_{\|\cdot\|_{2,\infty}\leq\alpha}(\phi)$

which looks related to the synthesis approach (one basically replaces the ${2,1}$-norm by its conjugate function). Actually, a similar relation between the dual of the analysis approach and the synthesis approach always holds.

Yesterday I uploaded the paper An extended Perona-Malik model based on probabilistic models by Lars Mescheder and myself to the arXiv. Recently I already blogged about this work, so I do not have to add that much. The main theme of the work was that if we have an image that is a blurred and noisy version of some true image, we formulate the reconstruction via Bayesian statistics. As a prior model for images we used a Gaussian scale mixture, i.e. we have a latent variable $z$ (in every pixel $x$) and the joint prior for the image $u$ and the latent variable $z$ is

$p(u,z) \propto \exp\Big(-\sum_x \frac{z(x)}2 |\nabla u(x)|^2 + v(z(x))\Big)$

where $x$ denotes the pixels, $\nabla$ is the discrete gradient of the image and $v$ is some non-negative function defined on the non-negative reals. Besides algorithms to approximate the MAP estimate, Lars proposed a mean-field approximation which does not calculate the most probable image, but iteratively approximates the posterior distribution by distributions which factorize over the variables $u$ and $z$. Using some further approximation (since the resulting algorithm in its plain form is not implementable) one arrives at an algorithm which some keeps more uncertainty in $u$ and, in practice, gives a point estimate for the denoised image that seems more “representative”. Here is an example:

This is the blurred and noisy image (the blur is motions blur and implemented with periodic boundary conditions for simplicity):

The next image is the approximation of the MAP estimate we got and you see the usual drawbacks. Since the MAP neglects all uncertainty in $u$ and maximizes the posterior probability, the image is “too sharp” in the way that smooth transitions (e.g. at the lighthouse) turn into piecewise constant stairs. Also the rocks appear very blocky.

Here is the result from the mean-field approximation. Since uncertainty in $u$ is taken into account, the image does not suffer from staircasing and also has a more natural appeal, most prominently in the part with the rocks.

The paper has some more examples and also shows another relation to Mumford-Shah denoising (loosely speaking, one uses a discrete latent variable $z$ to serve as a binary variable to say if a pixel is an edge or not). Oh, by the way, the algorithms behind the mean-field approximation use some parts of more general duality between random variables that Lars and his co-authors develop in another paper.

Currently I am at the SIAM Imaging conference in Hong Kong. It’s a great conference with great people at a great place. I am pretty sure that this will be the only post from here, since the conference is quite intense. I just wanted to report on two ideas that have become clear here, although, they are both pretty easy and probably already widely known, but anyway:

1. Non-convex + convex objective

There are a lot of talks that deal with optimization problems of the form

$\displaystyle \min_u F(u) + G(u).$

Especially, people try to leverage as much structure of the functionals ${F}$ and ${G}$ as possible. Frequently, there arises a need to deal with non-convex parts of the objective, and indeed, there are several approaches around that deal in one way or another with non-convexity of ${F}$ or even ${F+G}$. Usually, in the presence of an ${F}$ that is not convex, it is helpful if ${G}$ has favorable properties, e.g. that still ${F+G}$ is bounded from below, coercive or even convex again. A particularly helpful property is strong convexity of ${G}$ (i.e. ${G}$ stays convex even if you subtract ${\epsilon/2\|\cdot\|^2}$ from it). Here comes the simple idea: If you already allow ${F}$ to be non-convex, but only have a ${G}$ that is merely convex, but not strongly so, you can modify your objective to

$\displaystyle \underbrace{F(u) - \tfrac\epsilon2\|u\|^2}_{\leftarrow F(u)} + \underbrace{G(u) + \tfrac\epsilon2\|u\|^2}_{\leftarrow G(u)}$

for some ${\epsilon>0}$. This will give you strong convexity of ${G}$ and an ${F}$ that is (often) theoretically no worse than it used to be. It appeared to me that this is an idea that Kristian Bredies told me already almost ten years ago and which me made into a paper (together with Peter Maaß) in 2005 which got somehow delayed and published no earlier than 2009.

2. Convex-concave saddle point problems

If your problem has the form

$\displaystyle \min_u F(u) + G(Ku)$

with some linear operator ${K}$ and both ${F}$ and ${G}$ are convex, it has turned out, that it is tremendously helpful for the solution to consider the corresponding saddle point formulation: I.e. using the convex conjugate ${G^*}$ of ${G}$, you write

$\displaystyle \min_u \max_v F(u) + \langle Ku, v\rangle -G^*(v).$

A class of algorithms, that looks like to Arrow-Hurwicz-method at first glance, has been sparked be the method proposed by Chambolle and Pock. This method allows ${F}$ and ${G}$ to be merely convex (no smoothness or strong convexity needed) and only needs the proximal operators for both ${F}$ and ${G^*}$. I also worked on algorithms for slightly more general problems, involving a reformulation of the saddle point problem as a monotone inclusion, with Tom Pock in the paper An accelerated forward-backward algorithm for monotone inclusions and I also should mention this nice approach by Bredies and Sun who consider another reformulation of the monotone inclusion. However, in the spirit of the first point, one should take advantage of all the available structure in the problem, e.g. smoothness of one of the terms. Some algorithm can exploit smoothness of either ${F}$ or ${G^*}$ and only need convexity of the other term. An idea, that has been used for some time already, to tackle the case if ${F}$, say, is a sum of a smooth part and a non-smooth part (and ${G^*}$ is not smooth), is, to dualize the non-smooth part of ${F}$: Say we have ${F = F_1 + F_2}$ with smooth ${F_1}$, then you could write

$\displaystyle \begin{array}{rcl} &\min_u\max_v F_1(u) + F_2(u) + \langle Ku, v\rangle -G^*(v)\\ & \qquad= \max_u \min_{v,w} F_1(u) + \langle u,w\rangle + \langle Ku, v\rangle -G^*(v) - F_2^*(w) \end{array}$

and you are back in business, if your method allows for sums of convex functions in the dual. The trick got the sticky name “dual transportation trick” in a talk by Marc Teboulle here and probably that will help, that I will not forget it from now on…

I fell a little bit behind on reporting on my new preprints. In this posts I’ll blog on two closely related ones; one of them already a bit old, the other one quite recent:

The papers are

As clear from the titles, both papers treat a similar method. The first paper contains all the theory and the second one has few particularly interesting applications.

In the first paper we propose to view several known algorithms such as the linearized Bregman method, the Kaczmarz method or the Landweber method from a different angle from which they all are special cases of another algorithm. To start with, consider a linear system

$\displaystyle Ax=b$

with ${A\in{\mathbb R}^{m\times n}}$. A fairly simple and old method to solve this, is the Landweber iteration which is

$\displaystyle x^{k+1} = x^k - t_k A^T(Ax^k-b).$

Obviously, this is nothing else than a gradient descent for the functional ${\|Ax-b\|_2^2}$ and indeed converges to a minimizer of this functional (i.e. a least squares solution) if the stepsizes ${t_k}$ fulfill ${\epsilon\leq t_k\leq 2\|A\|^{-2} - \epsilon}$ for some ${\epsilon>0}$. If one initializes the method with ${x^0=0}$ it converges to the least squares solution with minimal norm, i.e. to ${A^\dag b}$ (with the pseudo-inverse ${A^\dag}$).

A totally different method is even older: The Kaczmarz method. Denoting by ${a_k}$ the ${k}$-th row of ${A}$ and ${b_k}$ the ${k}$-th entry of ${b}$ the method reads as

$\displaystyle x^{k+1} = x^k - a_{r(k)}^T\frac{a_{r(k)}\cdot x^k - b_k}{\|a_{r(k)}\|_2^2}$

where ${r(k) = (k\mod m) +1}$ or any other “control sequence” that picks up every index infinitely often. This method also has a simple interpretation: Each equation ${a_k\cdot x = b_k}$ describes a hyperplane in ${{\mathbb R}^n}$. The method does nothing else than projecting the iterates orthogonally onto the hyperplanes in an iterative manner. In the case that the system has a solution, the method converges to one, and if it is initialized with ${x^0=0}$ we have again convergence to the minimum norm solution ${A^\dag b}$.

There is yet another method that solves ${Ax=b}$ (but now it’s a bit more recent): The iteration produces two sequences of iterates

$\displaystyle \begin{array}{rcl} z^{k+1} & = &z^k - t_k A^T(Ax^k - b)\\ x^{k+1} & = &S_\lambda(z^{k+1}) \end{array}$

for some ${\lambda>0}$, the soft-thresholding function ${S_\lambda(x) = \max(|x|-\lambda,0)\mathrm{sgn}(x)}$ and some stepsize ${t_k}$. For reasons I will not detail here, this is called the linearized Bregman method. It also converges to a solution of the system. The method is remarkably similar, but different from, the Landweber iteration (if the soft-thresholding function wouldn’t be there, both would be the same). It converges to the solution of ${Ax=b}$ that has the minimum value for the functional ${J(x) = \lambda\|x\|_1 + \tfrac12\|x\|_2^2}$. Since this solution of close, and for ${\lambda}$ large enough identical, to the minimum ${\|\cdot\|_1}$ solution, the linearized Bregman method is a method for sparse reconstruction and applied in compressed sensing.

Now we put all three methods in a joint framework, and this is the framework of split feasibility problems (SFP). An SFP is a special case of a convex feasibility problems where one wants to find a point ${x}$ in the intersection of multiple simple convex sets. In an SFP one has two different kinds of convex constraints (which I will call “simple” and “difficult” in the following):

1. Constraints that just demand that ${x\in C_i}$ for some convex sets ${C_i}$. I call these constraints “simple” because we assume that the projection onto each ${C_i}$ is simple to obtain.
2. Constraints that demand ${A_ix\in Q_i}$ for some matrices ${A_i}$ and simple convex sets ${Q_i}$. Although we assume that projections onto the ${Q_i}$ are easy, these constraints are “difficult” because of the presence of the matrices ${A_i}$.

If there were only simple constraints a very basic method to solve the problem is the methods of alternating projections, also known as POCS (projection onto convex sets): Simply project onto all the sets ${C_i}$ in an iterative manner. For difficult constraints, one can do the following: Construct a hyperplane ${H_k}$ that separates the current iterate ${x^k}$ from the set defined by the constraint ${Ax\in Q}$ and project onto the hyperplane. Since projections onto hyperplanes are simple and since the hyperplane separates we move closer to the constraint set and this is a reasonable step to take. One such separating hyperplane is given as follows: For ${x^k}$ compute ${w^k = Ax^k-P_Q(Ax^k)}$ (with the orthogonal projection ${P_Q}$) and define

$\displaystyle H_k = \{x\ : (A^Tw^k)^T\cdot x \leq (A^Tw^k)^T\cdot x^k - \|w^k\|_2^2\}.$

Illustration of projections onto convex sets and separating hyperplanes

Now we already can unite the Landweber iteration and the Kaczmarz method as follows: Consider the system ${Ax=b}$ as a split feasibility problem in two different ways:

1. Treat ${Ax=b}$ as one single difficult constraint (i.e. set ${Q=\{b\}}$). Some calculations show that the above proposed method leads to the Landweber iteration (with a special stepsize).
2. Treat ${Ax=b}$ as ${m}$ simple constraints ${a_i\cdot x = b_i}$. Again, some calculations show that this gives the Kaczmarz method.

Of course, one could also work “block-wise” and consider groups of equations as difficult constraints to obtain “block-Kaczmarz methods”.

Now comes the last twist: By adapting the term of “projection” one gets more methods. Particularly interesting is the notion of Bregman projections which comes from Bregman distances. I will not go into detail here, but Bregman distances are associated to convex functionals ${J}$ and by replacing “projection onto ${C_i}$ or hyperplanes” by respective Bregman projections, one gets another method for split feasibility problems. The two things I found remarkable:

• The Bregman projection onto hyperplanes is pretty simple. To project some ${x^k}$ onto the hyperplane ${H = \{x\ :\ a^T\cdot x\leq \beta\}}$, one needs a subgradient ${z^k\in\partial J(x^k)}$ (in fact an “admissible one” but for that detail see the paper) and then performs

$\displaystyle x^{k+1} = \nabla J^*(z^k - t_k a)$

(${J^*}$ is the convex dual of ${J}$) with some appropriate stepsize ${t_k}$ (which is the solution of a one-dimensional convex minimization problem). Moreover, ${z^{k+1} = z^k - t_k a}$ is a new admissible subgradient at ${x^{k+1}}$.

• If one has a problem with a constraint ${Ax=b}$ (formulated as an SFP in one way or another) the method converges to the minimum-${J}$ solution of the equation if ${J}$ is strongly convex.

Note that strong convexity of ${J}$ implies differentiability of ${J^*}$ and Lipschitz continuity of ${\nabla J}$ and hence, the Bregman projection can indeed be carried out.

Now one already sees how this relates to the linearized Bregman method: Setting ${J(x) = \lambda\|x\|_1 + \tfrac12\|x\|_2^2}$, a little calculation shows that

$\displaystyle \nabla J^*(z) = S_\lambda(z).$

Hence, using the formulation with a “single difficult constraint” leads to the linearized Bregman method with a specific stepsize. It turns out that this stepsize is a pretty good one but also that one can show that a constant stepsize also works as long as it is positive and smaller that ${2\|A\|^{-2}}$.

In the paper we present several examples how one can use the framework. I see one strengths of this approach that one can add convex constraints to a given problem without getting into any trouble with the algorithmic framework.

The second paper extends a remark that we make in the first one: If one applies the framework of the linearized Bregman method to the case in which one considers the system ${Ax=b}$ as ${m}$ simple (hyperplane-)constraints one obtains a sparse Kaczmarz solver. Indeed one can use the simple iteration

$\displaystyle \begin{array}{rcl} z^{k+1} & = &z^k - a_{r(k)}^T\frac{a_{r(k)}\cdot x^k - b_k}{\|a_{r(k)}\|_2^2}\\ x^{k+1} & = &S_\lambda(z^{k+1}) \end{array}$

and will converge to the same sparse solution as the linearized Bregman method.

This method has a nice application to “online compressed sensing”: We illustrate this in the paper with an example from radio interferometry. There, large arrays of radio telescopes collect radio emissions from the sky. Each pair of telescopes lead to a single measurement of the Fourier transform of the quantity of interest. Hence, for ${k}$ telescopes, each measurement gives ${k(k-1)/2}$ samples in the Fourier domain. In our example we used data from the Very Large Array telescope which has 27 telescopes leading to 351 Fourier samples. That’s not much, if one want a picture of the emission with several ten thousands of pixels. But the good thing is that the Earth rotates (that’s good for several reasons): When the Earth rotates relative to the sky, the sampling pattern also rotates. Hence, one waits a small amount of time and makes another measurement. Commonly, this is done until the earth has made a half rotation, i.e. one complete measurement takes 12 hours. With the “online compressed sensing” framework we proposed, one can start reconstructing the image as soon the first measurements have arrived. Interestingly, one observes the following behavior: If one monitors the residual of the equation, it goes down during iterations and jumps up when new measurements arrive. But from some point on, the residual stays small! This says that the new measurements do not contradict the previous ones and more interestingly this happened precisely when the reconstruction error dropped down such that “exact reconstruction” in the sense of compressed sensing has happened. In the example of radio interferometry, this happened after 2.5 hours!

Reconstruction by online compressed sensing

You can find slides of a talk I gave at the Sparse Tomo Days here.

Today I’d like to collect some comments one a few papers I stumbled upon recently on the arXiv.

1. TGV minimizers in 1D

First, about a month ago two very similar paper appeared in the same week:

Both papers treat the recently proposed “total generalized variation” model (which is a somehow-but-not-really-higher-order generalization of total variation). The total variation of a function ${u\in L^1(\Omega)}$ (${\Omega\subset{\mathbb R}^d}$) is defined by duality

$\displaystyle TV(u) = \sup\Big\{\int_\Omega \mathrm{div} \phi\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,{\mathbb R}^d), |\phi|\leq 1\Big\}.$

(Note that the demanded high regularity of the test functions ${\phi}$ is not essential here, as we take a supremum over all these functions under the only, but important, requirement that the functions are bounded. Test functions from ${C^1_c(\Omega,{\mathbb R}^d)}$ would also do.)

Several possibilities for extensions and generalization of the total variation exist by somehow including higher order derivatives. The “total generalized variation” is a particular successful approach which reads as (now using two non-negative parameter ${\alpha,\beta}$ which do a weighting):

$\displaystyle TGV_{\beta,\alpha}^2(u) = \sup\Big\{\int_\Omega \mathrm{div}^2 \phi\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,S^{d\times d}),\ |\phi|\leq \beta,\ |\mathrm{div}\phi|\leq \alpha\Big\}.$

To clarify some notation: ${S^{d\times d}}$ are the symmetric ${d\times d}$ matrices, ${\mathrm{div}^n}$ is the negative adjoint of ${\nabla^n}$ which is the differential operator that collects all partial derivatives up to the ${n}$-th order in a ${d\times\cdots\times d}$-tensor. Moreover ${|\phi|}$ is some matrix norm (e.g. the Frobenius norm) and ${|\mathrm{div}\phi|}$ is some vector norm (e.g. the 2-norm).

Both papers investigate so called denoising problems with TGV penalty and ${L^2}$ discrepancy, i.e. minimization problems

$\displaystyle \min_u \frac12\int_\Omega(u-u^0)^2\, dx + TGV_{\alpha,\beta}^2(u)$

for a given ${u^0}$. Moreover, both papers treat the one dimensional case and investigate very special cases in which they calculate minimizers analytically. In one dimension the definition of ${TGV^2}$ becomes a little more familiar:

$\displaystyle TGV_{\beta,\alpha}^2(u) = \sup\Big\{\int_\Omega \phi''\, u\,dx\ :\ \phi\in C^\infty_c(\Omega,{\mathbb R}),\ |\phi|\leq \beta,\ |\phi'|\leq \alpha\Big\}.$

Some images of both papar are really similar: This one from Papafitsoros and Bredies

and this one from Pöschl and Scherzer

Although both paper have a very similar scopes it is worth to read both. The calculations are tedious but both paper try to make them accessible and try hard (and did a good job) to provide helpful illustrations. Curiously, the earlier paper cites the later one but not conversely…

2. Generalized conditional gradient methods

Another paper I found very interesting was

This paper shows a nice duality which I haven’t been aware of, namely the one between the subgradient descent methods and conditional gradient methods. In fact the conditional gradient method which is treated is a generalization of the conditional gradient method which Kristian and I also proposed a while ago in the context of ${\ell^1}$-minimization in the paper Iterated hard shrinkage for minimization problems with sparsity constraints: To minimize the sum

$\displaystyle F(u) + \Phi(u)$

with a differentiable ${F}$ and a convex ${\Phi}$ for which the subgradient of ${\Phi}$ is easily invertible (or, put differently, for which you can minimize ${\langle u,a\rangle + \Phi(u)}$ easily), perform the following iteration:

1. At iterate ${u^n}$ linearize ${F}$ but not ${\Phi}$ and calculate a new point ${v^n}$ by

$\displaystyle v^n = \mathrm{argmin}_v \langle F'(u^n),v\rangle + \Phi(v)$

2. Choose a stepsize ${s^n\in [0,1]}$ and set the next iterate as a convex combination of ${u^n}$ and ${v^n}$

$\displaystyle u^{n+1} = u^n + s_n(v^n - u^n).$

Note that for and indicator function

$\displaystyle \Phi(u) = \begin{cases} 0 & u\in C\\ \infty & \text{else} \end{cases}$

you obtain the conditional gradient method (also known as Frank-Wolfe method). While Kristian and I derived convergence with an asymptotic rate for the case of ${F(u) = \tfrac12\|Ku-f\|^2}$ and ${\Phi}$ strongly coercive, Francis uses the formulation ${F(u) = f(Au)}$ the assumption that the dual ${f^*}$ of ${f}$ has a bounded effective domain (which say that ${f}$ has linear growth in all directions). With this assumption he obtains explicit constants and rates also for the primal-dual gap. It was great to see that eventually somebody really took the idea from the paper Iterated hard shrinkage for minimization problems with sparsity constraints (and does not think that we do heuristics for ${\ell^0}$ minimization…).

I found this draft of a post in my backlog and decided that it would be better to finish it than to leave it unpublished. Actually, the story happened already over a year ago.

Some month ago I stumbled upon this piece of shadow art

by Fred Eerdekens and a few days later I received the information that my university was going to celebrate his yearly “open house event” this year as “TU Night”. The theme of the TU Night was “Night, Light, Energy” and all member were invited to submit ideas for talks, experiments and exhibits.

The piece of shadow art prompted the question “If this weired piece of metal can cast this sentence as a shadow, wouldn’t it be possible to produce another piece of metal that can produce two different sentences, when illuminated from different directions” Together with the announcement of the upcoming TU Night I thought if one could even produce an exhibit like this.

Since I am by no means an artists, I looked around at my university and found that there is a Department of Architecture. Since architects are much closer to being artist than I am, I contacted the department and proposed a collaboration and well, Henri Greil proposed to have a joint seminar on this topic. Hence, this summer term I made the experience and worked with students of architecture.

In the end, the student produced very nice pieces of shadow art:

Although the exhibits produced interesting and unexpected shadows, no group of students could make it and produce two different shadows out of the same object.

However, some nice effects can be produced pretty easy:

The basic idea is that moving one object around will move around both shadows rather independently. Well this is not totally true but what you can do is to “zoom” one shadow while moving the other sideways (just move the object straight towards one light source). See this movie for a small illustration:

I also did my best to produce a more complex object. While it is theoretically not very difficult to see that some given shadows are possible in some given projection geometry, it is not at all straight forward to produce the object theoretically (not to speak of the real world problems while building the piece). It tried hard but I could not do better than this:

In this post I will explore a bit the question on how to calculate the discrete gradient and the discrete divergence of an image and a vector field, respectively.

Let ${u_0\in{\mathbb R}^{N\times M}}$ be a discrete image, i.e. ${u_0(i,j)}$ denotes the gray value at the ${i,j}$-th pixel. The famous total variation denoising amounts to minimizing the functional

$\displaystyle \Phi(u) = \tfrac12 \int (u-u_0)^2 + \lambda\int|\nabla u|$

where the integral shall be understood as summation, ${\nabla}$ stand for the gradient, i.e. the vector of the partial derivatives, and ${|\cdot|}$ stands for the euclidean absolute value.

When using primal-dual methods, it is of crucial importance, that the used operators for the gradient and the divergence are numerically adjoint (up to the minus sign). That is, the numerical operations are adjoint in the following sense: If grad is the operation for the gradient and div is the operation for the divergence, then it should hold for any variables ${u}$ and ${v}$ of suitable size and with gradu = grad(u), divv = div(v) that sum(gradu(:).*v(:)) and -sum(u(:).*divv(:)) are equal up to numerical precision. Due to the boundary treatment, the internal MATLAB operations gradient and divergence do not fulfill this requirement.

The most common discretization of the gradient uses discrete forward differences and a constant padding at the boundary (which means that Neumann boundary values are applied). In formula, this reads as

$\displaystyle (D_xu)_{i,j} = \begin{cases} u_{i+1,j} - u_{i,j} & i

The respective adjoints are backward differences with zero boundary treatment (check it!). Apparently, there are many different ways to implement this routines (and the respective adjoints) in MATLAB. Here are four of them:

1. For loops: Run through all pixels in two for-loops and assign the difference to the output. Of course, one should preallocate the output prior to the loop. But you may probably know the first two rules of MATLAB coding? If not, here they are: 1. Avoid for-loops. 2. Avoid for-loops, seriously. I put the routines into extra functions and created anonymous function to call the gradient and the divergence as
grad = @(u) cat(3,dxp(u),dyp(u));
div = @(V) dxm(V(:,:,1)) + dym(V(:,:,2));

2. Shift and subtract: MATLAB is great in using vectors and matrices. And to avoid the for-loop one could also implement the forward difference in ${x}$-direction by shifting the matrix and subtract the original one, i.e. [u(:,2:end) u(:,end)] - u (and similarly for the other differences). Again, I wrote extra functions and used anonymous function as above.
3. Small sparse matrices from the left and from the right: MATLAB is also pretty good with sparse matrices. Since the derivatives in ${x}$-direction only involve the subtraction of two elements in the same column, one can realize this by multiplying an image from the left with a sparse diagonal matrix with just two non-zero diagonals. Similarly, the derivative in ${y}$-direction can be realized by multiplying from the right with a suitable matrix. More precisely, this approach is realized by
Dy = spdiags([-ones(M,1) ones(M,1)],[0 1],M,M);
Dy(M,:) = 0;
Dx = spdiags([-ones(N,1) ones(N,1)],[0 1],N,N);
Dy(N,:) = 0;
Dxu = Dx*u;
Dyu = u*Dy';


(check it). Note that the adjoint of ${x}$-derivative if simple the operation Dx'*u and the operation of the ${y}$-derivative is u*Dy. Together, the calculation of the gradient and the divergence was done by the anonymous functions

grad = @(u) cat(3,u*Dx',Dy*u);
div = @(V) V(:,:,1)*Dx + Dy'*V(:,:,2);

4. Large sparse matrices: One could think about the following: Vectorize the image by U = u(:) (which amounts to stacking the columns above each other). Then assemble a large ${NM\times NM}$ sparse matrix which has just two non-zero diagonals to do the forward (and other) differences. More precisely this can be done by
(with Dx and Dy from above)

DX = kron(Dx,speye(M));
DY = kron(speye(N),Dy);
DxU = DX*U;
DyU = DY*U;


Here, it is clear that the respective adjoint are just the multiplication with the transposed matrices. Here, the anonymous functions are

grad = @(u) [DX*u DY*u];
div = @(V) DX'*V(:,1) + DY'*V(:,2);


The different approaches have different pros and cons. Well, the for-loop only has cons: It is presumably slow and uses a lot indexing which easily leads to bugs. The shift-and-subtract method should go into an extra function to make it easy to use – but this is not necessarily a drawback. For the multiplication with the large matrix, one has to vectorize the image first and every time one wants to look at the result and need to do reshape(U,N,M). But let’s focus on speed: I implemented all methods, and let the run on square images of different sizes for 50 times (after a warmup) and measures the execution times with tic and toc. The assembly of the matrices did not enter the timing. Moreover, no parallel computation was used – just a single core. Finally, memory was not an issue since the larges matrices (of size ${2500\times 2500\times 2}$) only need roughly 100MB of memory.

Here is the table with the average times for one calculation of the gradient (in seconds):

 ${N}$ For-loop Shift-subtract left-right left 100 0.0004 0.0002 0.0001 0.0003 200 0.0018 0.0010 0.0005 0.0015 300 0.0057 0.0011 0.0014 0.0020 400 0.0096 0.0031 0.0022 0.0035 500 0.0178 0.0035 0.0030 0.0054 750 0.0449 0.0114 0.0097 0.0123 1000 0.0737 0.0189 0.0128 0.0212 1500 0.2055 0.0576 0.0379 0.0601 2000 0.3942 0.0915 0.0671 0.1136 2500 0.6719 0.1571 0.1068 0.1788

and here is the one for the divergences:

 ${N}$ For-loop Shift-subtract left-right left 100 0.0004 0.0003 0.0002 0.0002 200 0.0018 0.0015 0.0005 0.0008 300 0.0048 0.0016 0.0015 0.0012 400 0.0090 0.0036 0.0020 0.0022 500 0.0158 0.0057 0.0027 0.0035 750 0.0409 0.0132 0.0073 0.0069 1000 0.0708 0.0238 0.0130 0.0125 1500 0.2008 0.0654 0.0344 0.0370 2000 0.3886 0.1285 0.0622 0.0671 2500 0.6627 0.2512 0.1084 0.1361

As expected, the for-loop is clearly slower and also as expected, all methods basically scale quadratically (doubling ${N}$ amounts to multiplying the running time by four) since the work per pixel is constant. A little surprisingly, the multiplication from the left and from the right is fastest and also consistently a little faster then multiplication from the left with the larger sparse matrix. I don’t know, why the results are that different for the gradient and for the divergence. Maybe this is related to my use on anonymous functions or the allocation of memory?

Next Page »