Assume that we are given a probability distribution ${p}$ on ${{\mathbb R}^{n}}$ and for simplicity we further assume that ${p}$ is continuous, i.e. ${p(x)}$ somehow indicates, how likely ${x\in{\mathbb R}^{n}}$ is. To be more precise, the probability, that ${x\in A}$ for some (measurable) set ${A}$ is, is ${\int_{A}dp}$. We do I start like this? There are cases, in which one known some probability distribution, then obtains some ${x}$ and wants to know, if this ${x}$ was particularly representative for this distribution. One example of this situation is a statistical inverse problem where the model is ${y = Ax + \epsilon}$ with a linear (known) map ${A}$ and some error ${\epsilon}$. From some observed ${y}$, one wants to know as much as possible about the underlying ${x}$. Assuming that the noise ${\epsilon}$ has a known distribution ${p(\epsilon)}$ and that one has prior information on ${x}$ (i.e. likely ${x}$ are ones where the prior distribution ${p(x)}$ is large) one can model the following probability distributions:

$\displaystyle \begin{array}{rcl} p(y\mid x) = p(y-Ax) = p(\epsilon) &&\text{prob. to see}\ y,\ \text{provided}\ x\\ p(x) && \text{prior distribution of}\ x\\ p(x\mid y) = \frac{p(y\mid x)p(x)}{p(y)} && \text{prob. of}\ x,\ \text{provided}\ y; \text{posterior}. \end{array}$

The latter distribution, the posterior, is what one is interested in, i.e. given that one has observed ${y}$, what is the probability that some ${x}$ (which we may generate in some way) is the true solution? So, in principle, all information we have is contained in the posterior, but how to get on grip on it?

What we know are the distribution ${p(y\mid x)}$ and ${p(x)}$ but note that ${p(y)}$ is not know and is usually not simple to compute. Another expression for ${p(y)}$ is ${\int p(y\mid x)p(x) dy}$ since the right hand side in the definition of the posterior has to be a probability distribution. So to calculate ${p(y)}$ we have to evaluate one integral, but note that this is in an integral over ${{\mathbb R}^{n}}$ and in the context of statistical inverse problems, this ${n}$ is usually not in the ten-thousands, but easily in the millions. So we see, that the calculation of ${p(y)}$ is usually out of reach (unless all distributions are very simple). So where are we: If we get some ${x}$ we can calculate ${p(y\mid x)}$ and ${p(x)}$, but what would it say if ${p(y\mid x)p(x) =0.01}$, for example? Not much, because we do not know the the normalization constant. While ${0.01}$ seems pretty small and hence ${x}$ to be somehow unlikely, it may be that ${p(y) = 0.011}$ and then ${ p(x\mid y) = 0.01/0.011 \approx 0.91}$ which would make ${x}$ to be rather likely. On the other hand we would really would like some more information about ${p(x\mid y)}$ to judge about that ${x}$ since ${p(x\mid y)}$ may have a large variance and a value of ${0.01}$ may be among the largest possible values, again rendering ${x}$ quite likely, but subject to large variance. To recap:

1. We have a probability distribution ${p}$ but we can only generate values ${Cp(x)}$ for some unknown ${C>0}$.
2. The domain of definition of ${p}$ has a huge dimension.
3. We want to know as much as possible about ${p}$, e.g. its expected value, its variance or its mode…

Note that some questions about ${p}$ can be answered without knowing the normalization constant, e.g. the mode ${x^{*}}$, i.e. the ${x}$ which maximizes ${p}$. That may one prime reason while MAP (maximum-a-posterior) estimators are so widely used… One approach, to get more information out of ${p}$ is to use sampling.

Before we come to methods that can sample from unnormalized distributions, we describe what we actually mean by sampling, and give the main building blocks.

1. Sampling from simple distribution

Sampling from a distribution ${p}$ means a method to generate values ${x}$ such that the values are distributed according to ${p}$. Expressed in formula: For every integrable function ${f}$ and sequence ${x_{k}}$ of generated samples, we want that

$\displaystyle \begin{array}{rcl} \lim_{n\rightarrow\infty}\frac1n\sum_{k=1}^{n}f(x_{k}) = \int f(x)dp(x). \end{array}$

In the following we want to describe a few methods to generate samples. All the methods will build on the ability to sample from some simple basic distributions, and these are

1. The uniform distribution on ${[0,1]}$ denoted by ${\mathcal{U}(0,1)}$. On a computer this is possible to a good approximation by pseudorandom number generators like the Mersenne twister (used, e.g., in MATLAB).
2. The standard normal distribution denoted by ${\mathcal{N}(0,1)}$. If you generate pseudorandom normally distributed numbers, then, under the hood, the machine generates uniformly distributed numbers and cleverly transforms these to be normally distributed, e.g. with the Box-Muller method.

Note that simple scaling allows to sample from ${\mathcal{U}(a,b)}$ and ${\mathcal{N}(\mu,\sigma^{2})}$.

If we sample ${x}$ from some distribution ${p}$ we will denote this by

$\displaystyle \begin{array}{rcl} x\sim p \end{array}$

so ${x\sim \mathcal{U}(0,1)}$ means that ${x}$ is drawn from a uniform distribution over ${[0,1]}$.

2. Rejection sampling

For rejection sampling from ${p}$ (having access to ${Cp}$ only) we need a so-called proposal distribution ${q}$ from which we can sample (i.e. a uniform one or a normal one) and we need to know some ${M>0}$ such that

$\displaystyle \begin{array}{rcl} Mq(x) \geq Cp(x) \end{array}$

for all ${x}$. The sampling scheme goes as follows: Draw proposals from the proposal distribution and accept them based on some rule that will ensure the we will end up with samples distributed according to ${p}$. In more detail:

1. generate ${x\sim q}$, i.e. sample ${x}$ from the proposal distribution,
2. calculate the acceptance rate

$\displaystyle \begin{array}{rcl} \alpha = \frac{Cp(x)}{Mq(x)}, \end{array}$

3. generate ${t\sim \mathcal{U}(0,1)}$,
4. accept ${x}$ if ${t\leq \alpha}$, otherwise reject it.

Proposition 1 The samples generated by rejection sampling are distributed according to ${p}$.

Proof: Let us denote by ${A}$ the event that a sample ${x}$, drawn from ${q}$ has been accepted. Further, we denote by ${\pi(x\mid A)}$ the distribution of ${x}$, provided it has been accepted. So we aim to show that ${\pi(x\mid A) = p(x)}$.

To do so, we employ Bayes’ theorem and write

$\displaystyle \begin{array}{rcl} \pi(x\mid A) = \frac{\pi(A\mid x)\pi(x)}{\pi(A)}. \end{array}$

For the three probabilities on the right hand side we know:

• ${\pi(x)}$ is the distribution of the sample, i.e the proposal distribution, ${\pi(x) = q(x)}$.
• The probability ${\pi(A\mid x)}$ is the probability of acceptance, provided that ${x}$ has been sampled. Looking at the acceptance step (step 2), we see that this probability is exactly ${\alpha}$, i.e.

$\displaystyle \begin{array}{rcl} \pi(A\mid x) & = & \alpha = \frac{Cp(x)}{Mq(x)}. \end{array}$

• The probability of the event ${A}$ is the probability of acceptance, and this is given by the integral of the joint distribution ${\pi(x,A)}$ over the values ${x}$. Since the joint distribution fulfills ${\pi(x,A) = \pi(A\mid x)\pi(x)}$ and ${\pi(x) = q(x)}$ we get

$\displaystyle \begin{array}{rcl} \pi(A) & = & \int \pi(x,A) dx = \int \pi(A\mid x)q(x)dx\\ & = & \int \frac{Cp(x)}{Mq(x)}q(x) dx\\ & = & \int \frac{C}{M}p(x) dx = \frac{M}{C} \int p(x)dx = \frac{M}{C} \end{array}$

since ${p}$ is a probability distribution.

Together we get

$\displaystyle \begin{array}{rcl} \pi(x\mid A) = \frac{\frac{Cp(x)}{Mq(x)}\, q(x)}{\frac{C}{M}} = p(x). \end{array}$

$\Box$

The crucial steps to apply rejection sampling is to find a good proposal distribution with small constant ${M}$ (and also, to calculate ${M}$ can be tricky). As an example, consider that you want to sample from the tail of a standard normal distribution, e.g. you want to obtain “normally distributed random numbers larger that ${2}$”. Rejection sampling is pretty straightforward: You choose ${q}$ as standard normal, get samples ${x}$ from that and only accept if ${x\geq 2}$. In this case you have ${M=1}$, but ${C}$ is small, namely ${C\approx 0.023}$ which means that less than 1 out of 43 samples are accepted…

Here comes another method. This method is from the class of Markov-Chain methods, i.e. we will generate a sequence ${x_{1},x_{2},\dots}$ such that they form a Markov chain with a given transition probability. This means, we have a distribution ${\kappa}$ such that ${x_{k+1}\sim \kappa(\cdot,x_{k})}$. Again, our goal is that the sequence is distributed according to ${p}$. A central result in the theory of Markov-chain methods is, that this is the case when the so-called (detailed) balance equation is fulfilled , i.e. if

$\displaystyle \begin{array}{rcl} p(x)\kappa(y,x) = p(y)\kappa(x,y). \end{array}$

Remember, that we only have access to ${Cp(x)}$ and we don’t know ${C}$. Here is the method which is called Metropolis-Hastings sampling: To begin, choose a symmetric proposal distribution ${q}$, (i.e. ${q(x,y)}$ is the distribution for “go from ${y}$ to ${x}$” and fulfills ${q(x,y) = q(y,x)}$), initialize with some ${x_{0}}$ and set ${k=1}$. Then:

1. generate ${x^{*}\sim q(\cdot,x_{k-1})}$, i.e. make a trial step from ${x_{k-1}}$,
2. set ${\alpha = \min(1,\tfrac{p(x^{*})}{p(x_{k-1})})}$,
3. generate ${t\sim\mathcal{U}(0,1)}$
4. accept ${x^{*}}$ with probability ${\alpha}$, i.e. set ${x_{k} = x^{*}}$ if ${t\leq \alpha}$, increase ${k}$ and repeat, otherwise do reject, i.e. do not increase ${k}$ and repeat.

Note that in step 3 we don’t really need ${p(x^{*})}$ and ${p(x_{k+1})}$ but only ${Cp(x^{*})}$ and ${Cp(x_{k-1})}$ since the unknown ${C}$‘s cancel out.

Proposition 2 The samples generated from Metropolis-Hastings sampling are distributed according to ${p}$.

Proof: We first calculate the transition probability of the method. The probability to go from ${x_{k-1}}$ to the proposed ${x^{*}}$ is “${q(x^{*},x_{k-1})}$ multiplied by the acceptance rate ${\alpha}$” and the probability to stay in ${x_{k-1}}$ is ${1-\alpha}$. In formula: We write the acceptance ratio as

$\displaystyle \begin{array}{rcl} \alpha(x_{k-1},x^{*}) = \min(1,\frac{p(x^{*})}{p(x_{k-1})}) = \begin{cases} 1 & p(x^{*})\geq p(x_{k-1})\\ \frac{p(x^{*})}{p(x_{k-1})} & p(x_{k-1})\geq p(x^{*}) \end{cases} \end{array}$

and can write the transition probability as

$\displaystyle \begin{array}{rcl} \kappa(y,x) = \alpha(x,y)q(y,x) + (1-\alpha^{*}(x))\delta_{x}(y) \end{array}$

with

$\displaystyle \begin{array}{rcl} \alpha^{*}(x) = \int \alpha(x,y)q(y,x) dy. \end{array}$

We aim to show the detailed balance equation ${p(x)\kappa(y,x) = p(y)\kappa(x,y)}$. First by a simple distinction of cases, we get

$\displaystyle \begin{array}{rcl} \frac{\alpha(x,y)}{\alpha(y,x)} = \frac{p(y)}{p(x)}. \end{array}$

This gives ${p(x)\alpha(x,y) = p(y)\alpha(y,x)}$ and since ${q}$ is symmetric, we get

$\displaystyle \begin{array}{rcl} p(x)\alpha(x,y)q(x,y) = p(y)\alpha(y,x)q(x,y). \end{array}$

Then we note that

$\displaystyle \begin{array}{rcl} (1-\alpha^{*}(x))\delta_{x}(y) p(x) = (1-\alpha^{*}(y))\delta_{y}(x)p(y) \end{array}$

(which can be seen by integration both sides against some ${f(x,y)}$). Putting things together, we get

$\displaystyle \begin{array}{rcl} p(x)\kappa(y,x) & = & p(x) \alpha(x,y)q(y,x) + p(x)(1-\alpha^{*}(x))\delta_{x}(y)\\ & = & p(y)\alpha(y,x)q(x,y) + p(y)(1-\alpha^{*}(y)\delta_{y}(x)\\ & = & p(y)\kappa(x,y) \end{array}$

as desired. $\Box$

Note that Metropolis-Hastings sampling is fairly easy to implement as soon as it is simple to get samples from the proposal distribution. A quick way is to use the standard normal distribution as proposal distribution. In this case, Metropolis-Hastings performs a “restricted random walk”, i.e. if we would accept every step, the sequence ${x_{k}}$ would be a random walk. However, we allow for the possibility to stop the walk in the cases where it would lead us to a region of lower probability — in the cases where it leads to a point of higher probability, we follow the random walk. Although this approach is simple to implement, it may take a lot of time for the chain to get something reasonable. The reasons are, that the random walk steps may be rejected quite often, that it is a long way to get to where the “most probability is” from the initialization or that the distribution ${p}$ has several modes, separated by valleys and that the random walk has difficulties to get from one mode to the other (again due to frequent rejection).

Also note that one can use Gaussian distributions with any variance as proposal distributions and that the variance acts like some stepsize. As often with stepsizes, there is a trade-off: smaller stepsizes mean, that the sampler will need more time to explore the distribution while larger stepsizes would allow faster exploration but also lead to more frequent rejection.

Consider a convex optimization problem of the form

$\displaystyle \begin{array}{rcl} \min_{x}F(x) + G(Ax) \end{array}$

with convex ${F}$ and ${G}$ and matrix ${A}$. (We formulate everything quite loosely, skipping over details like continuity and such, as they are irrelevant for the subject matter). Optimization problems of this type have a specific type of dual problem, namely the Fenchel-Rockafellar dual, which is

$\displaystyle \begin{array}{rcl} \max_{y}-F^{*}(-A^{T}y) - G^{*}(y) \end{array}$

and under certain regularity conditions it holds that the optimal value of the dual equals the the objective value of the primal and, moreover, that a pair ${(x^{*},y^{*})}$ is both primal and dual optimal if and only if the primal dual gap is zero, i.e. if and only if

$\displaystyle \begin{array}{rcl} F(x^{*})+G(Ax^{*}) + F^{*}(-A^{T}y^{*})+G^{*}(y^{*}) = 0. \end{array}$

Hence, it is quite handy to use the primal dual gap as a stopping criteria for iterative methods to solve these problems. So, if one runs an algorithm which produces primal iterates ${x^{k}}$ and dual iterates ${y^{k}}$ one can monitor

$\displaystyle \begin{array}{rcl} \mathcal{G}(x^{k},y^{k}) = F(x^{k})+G(Ax^{k}) + F^{*}(-A^{T}y^{k})+G^{*}(y^{k}). \end{array}$

and stop if the value falls below a desired tolerance.

There is some problem with this approach which appears if the method produces infeasible iterates in the sense that one of the four terms in ${\mathcal{G}}$ is actually ${+\infty}$. This may be the case if ${F}$ or ${G}$ are not everywhere finite or, loosely speaking, have linear growth in some directions (since then the respective conjugate will not be finite everywhere). In the rest of the post, I’ll sketch a general method that can often solve this particular problem.

For the sake of simplicity, consider the following primal dual algorithm

$\displaystyle \begin{array}{rcl} x^{k+1} & = &\mathrm{prox}_{\tau F}(x^{k}-A^{T}y^{k})\\ y^{k+1} & = &\mathrm{prox}_{\sigma G^{*}}(y^{k}+\sigma A(2x^{k+1}-x^{k})) \end{array}$

(also know as primal dual hybrid gradient method or Chambolle-Pock’s algorithm). It converges as soon as ${\sigma\tau\leq \|A\|^{-2}}$.

While the structure of the algorithm ensures that ${F(x^{k})}$ and ${G^{*}(y^{k})}$ are always finite (since always ${\mathrm{prox}_{F}(x)\in\mathrm{dom}(F)}$), is may be that ${F^{*}(-A^{T}y^{k})}$ or ${G(Ax^{k})}$ are indeed infinite, rendering the primal dual gap useless.

Let us assume that the problematic term is ${F^{*}(-A^{T}y^{k})}$. Here is a way out in the case where one can deduce some a-priori bounds on ${x^{*}}$, i.e. a bounded and convex set ${C}$ with ${x^{*}\in C}$. In fact, this is often the case (e.g. one may know a-priori that there exist lower bounds ${l_{i}}$ and upper bounds ${u_{i}}$, i.e. it holds that ${l_{i}\leq x^{*}_{i}\leq u_{i}}$). Then, adding these constraints to the problem will not change the solution.

Let us see, how this changes the primal dual gap: We set ${\tilde F(x) = F(x) + I_{C}(x)}$ where ${C}$ is the set which models the bound constraints. Since ${C}$ is a bounded convex set and ${F}$ is finite on ${C}$, it is clear that

$\displaystyle \begin{array}{rcl} \tilde F^{*}(\xi) = \sup_{x\in C}\,\langle \xi,x\rangle - F(x) \end{array}$

is finite for every ${\xi}$. This leads to a finite duality gap. However, one should also adapt the prox operator. But this is also simple in the case where the constraint ${C}$ and the function ${F}$ are separable, i.e. ${C}$ encodes bound constraints as above (in other words ${C = [l_{1},u_{1}]\times\cdots\times [l_{n},u_{n}]}$) and

$\displaystyle \begin{array}{rcl} F(x) = \sum_{i} f_{i}(x_{i}). \end{array}$

Here it holds that

$\displaystyle \begin{array}{rcl} \mathrm{prox}_{\sigma \tilde F}(x)_{i} = \mathrm{prox}_{\sigma f_{i} + I_{[l_{i},u_{i}]}}(x_{i}) \end{array}$

and it is simple to see that

$\displaystyle \begin{array}{rcl} \mathrm{prox}_{\sigma f_{i} + I_{[l_{i},u_{i}]}}(x_{i}) = \mathrm{proj}_{[l_{i},u_{i}]}\mathrm{prox}_{\tau f_{i}}(x_{i}), \end{array}$

i.e., only uses the proximal operator of ${F}$ and project onto the constraints. For general ${C}$, this step may be more complicated.

One example, where this makes sense is ${L^{1}-TV}$ denoising which can be written as

$\displaystyle \begin{array}{rcl} \min_{u}\|u-u^{0}\|_{1} + \lambda TV(u). \end{array}$

Here we have

$\displaystyle \begin{array}{rcl} F(u) = \|u-u^{0}\|_{1},\quad A = \nabla,\quad G(\phi) = I_{|\phi_{ij}|\leq 1}(\phi). \end{array}$

The guy that causes problems here is ${F^{*}}$ which is an indicator functional and indeed ${A^{T}\phi^{k}}$ will usually be dual infeasible. But since ${u}$ is an image with a know range of gray values one can simple add the constraints ${0\leq u\leq 1}$ to the problem and obtains a finite dual while still keeping a simple proximal operator. It is quite instructive to compute ${\tilde F}$ in this case.

Yesterday I uploaded the paper An extended Perona-Malik model based on probabilistic models by Lars Mescheder and myself to the arXiv. Recently I already blogged about this work, so I do not have to add that much. The main theme of the work was that if we have an image that is a blurred and noisy version of some true image, we formulate the reconstruction via Bayesian statistics. As a prior model for images we used a Gaussian scale mixture, i.e. we have a latent variable $z$ (in every pixel $x$) and the joint prior for the image $u$ and the latent variable $z$ is

$p(u,z) \propto \exp\Big(-\sum_x \frac{z(x)}2 |\nabla u(x)|^2 + v(z(x))\Big)$

where $x$ denotes the pixels, $\nabla$ is the discrete gradient of the image and $v$ is some non-negative function defined on the non-negative reals. Besides algorithms to approximate the MAP estimate, Lars proposed a mean-field approximation which does not calculate the most probable image, but iteratively approximates the posterior distribution by distributions which factorize over the variables $u$ and $z$. Using some further approximation (since the resulting algorithm in its plain form is not implementable) one arrives at an algorithm which some keeps more uncertainty in $u$ and, in practice, gives a point estimate for the denoised image that seems more “representative”. Here is an example:

This is the blurred and noisy image (the blur is motions blur and implemented with periodic boundary conditions for simplicity):

The next image is the approximation of the MAP estimate we got and you see the usual drawbacks. Since the MAP neglects all uncertainty in $u$ and maximizes the posterior probability, the image is “too sharp” in the way that smooth transitions (e.g. at the lighthouse) turn into piecewise constant stairs. Also the rocks appear very blocky.

Here is the result from the mean-field approximation. Since uncertainty in $u$ is taken into account, the image does not suffer from staircasing and also has a more natural appeal, most prominently in the part with the rocks.

The paper has some more examples and also shows another relation to Mumford-Shah denoising (loosely speaking, one uses a discrete latent variable $z$ to serve as a binary variable to say if a pixel is an edge or not). Oh, by the way, the algorithms behind the mean-field approximation use some parts of more general duality between random variables that Lars and his co-authors develop in another paper.

Here is a small signal boost for the

# Workshop on the Interface of Statistics and Optimization

to  be held at Duke University, Durham, North Carolina from Feb 8 to Feb 10 2017. The workshop is part of the long-year program on optimization currently taking place at the Statistical and Applied Mathematical Sciences Institute (SAMSI).

There will be a lineup of invited speakers from the forefront of Statistics and Optimization each of which has made influential contributions to the other field as well. The planning is still ongoing, and hence, the list of speakers will grow some.

If you can’t make it to North Carolina next February, still mark the date since the talks will be broadcasted via the net and (if tech works out) you may even participate in the Q&A sessions after the talks via your computer.

Last week Christoph Brauer, Andreas Tillmann and myself uploaded the paper A Primal-Dual Homotopy Algorithm for ${\ell_{1}}$-Minimization with ${\ell_{\infty}}$-Constraints to the arXiv (and we missed being the first ever arXiv-paper with a non-trivial five-digit identifier by twenty-something papers…). This paper specifically deals with the optimization problem

$\displaystyle \begin{array}{rcl} \min_{x}\|x\|_{1}\quad\text{s.t.}\quad \|Ax-b\|_{\infty}\leq \delta \end{array}$

where ${A}$ and ${b}$ are a real matrix and vector, respecively, of compatible size. While the related problem with ${\ell_{2}}$ constraint has been addressed quite often (and the penalized problem ${\min_{x}\|x\|_{1} + \tfrac1{2\lambda}\|Ax-b\|_{2}^{2}}$ is even more popular) there is not much code around to solve this specific problem. Obvious candidates are

• Linear optimization: The problem can be recast as a linear program: The constraint is basically linear already (rewriting it with help of the ones vector ${\mathbf{1}}$ as ${Ax\leq \delta\mathbf{1}+b}$, ${-Ax\leq \delta\mathbf{1}-b}$) and for the objective one can, for example, perform a variable split ${x = x^{+}-x^{-}}$, ${x^{-},x^{+}\geq 0}$ and then write ${\|x\|_{1} = \mathbf{1}^{T}x^{+}+ \mathbf{1}^{T}x^{-}}$.
• Splitting methods: The problem is convex problem of the form ${\min_{x}F(x) + G(Ax)}$ with

$\displaystyle \begin{array}{rcl} F(x) & = & \|x\|_{1}\\ G(y) & = & \begin{cases} 0 & \|y-b\|\leq\delta\\ \infty & \text{else.} \end{cases} \end{array}$

and hence, several methods for these kind of problem are available, such as the alternating direction method of multipliers or the Chambolle-Pock algorithm.

The formulation as linear program has the advantage that one can choose among a lot of highly sophisticated software tools and the second has the advantage that the methods are very easy to code, usually in just a few lines. Drawbacks are, that both methods do exploit too much specific structure of the problem.

Application of the problem with ${\ell_{\infty}}$ constraint are, for example:

• The Dantzig selector, a statistical estimation technique, were the problem is

$\displaystyle \begin{array}{rcl} \min_{x}\|x\|_{1}\quad\text{s.t.}\quad \|A^{T}(Ax-b)\|_{\infty}\leq\delta. \end{array}$

• Sparse dequantization as elaborated here by Jacques, Hammond and Fadili and applied here to de-quantizaton of speech signals by Christoph, Timo Gerkmann and myself.

We wanted to see if one of the most efficient methods known for sparse reconstruction with ${\ell_{2}}$ penalty, namely the homotopy method, can be adapted to this case. The homotopy method for ${\min_{x}\|x\|_{1} + \tfrac1{2\lambda}\|Ax-b\|_{2}^{2}}$ builds on the observation that the solution for ${\lambda\geq \|A^{T}b\|_{\infty}}$ is zero and that the set of solutions ${x_{\lambda}}$, parameterized by the parameter ${\lambda}$, is piecewise linear. Hence, on can start from ${\lambda_{0}= \|A^{T}b\|_{\infty}}$, calculate which direction to go, how far the breakpoint is away, go there and start over. I’ve blogged on the homotopy method here already and there you’ll find some links to great software packages, but also the fact that there can be up to exponentially many breakpoints. However, in practice the homotopy method is usually blazingly fast and with some care, can be made numerically stable and accurate, see, e.g. our extensive study here (and here is the optimization online preprint).

The problem with ${\ell_{\infty}}$ constraint seems similar, especially it is clear that for ${\delta = \|b\|_{\infty}}$, ${x=0}$ is a solution. It is also not so difficult to see that there is a piecewise linear path of solutions ${x_{\delta}}$. What is not so clear is, how it can be computed. It turned out, that in this case the whole truth can be seen when the problem is viewed from a primal-dual viewpoint. The associated dual problem is

$\displaystyle \begin{array}{rcl} \max_{y}\ -b^{T}y - \delta\|y\|_{1}\quad\text{s.t.}\quad\|A^{T}y\|_{\infty}\leq\infty \end{array}$

and a pair ${(x^{*},y^{*})}$ is primal and dual optimal if and only if

$\displaystyle \begin{array}{rcl} -A^{T}y^{*}&\in&\text{Sign}(x^{*})\\ Ax^{*}-b & \in & \delta\text{Sign}(y^{*}) \end{array}$

(where ${\text{Sign}}$ denotes the sign function, multivalued at zero, giving ${[-1,1]}$ there). One can note some things from the primal-dual optimality system:

• You may find a direction ${d}$ such that ${(x^{*}+td,y^{*})}$ stays primal-dual optimal for the constraint ${\leq\delta-t}$ for small ${t}$,
• for a fixed primal optimal ${x_{\delta}}$ there may be several dual optimal ${y_{\delta}}$.

On the other hand, it is not that clear which of the probably many dual optimal ${y_{\delta}}$ allows to find a new direction ${d}$ such that ${x_{\delta}+td}$ with stay primal optimal when reducing ${\delta}$. In fact, it turned out that, at a breakpoint, a new dual variable needs to be found to allow for the next jump in the primal variable. So, the solution path is piecewise linear in the primal variable, but piecewise constant in the dual variable (a situation similar to the adaptive inverse scale space method).

What we found is, that some adapted theorem of the alternative (a.k.a. Farkas’ Lemma) allows to calculate the next dual optimal ${y}$ such that a jump in ${x}$ will be possible.

What is more, is that the calculation of a new primal or dual optimal point amounts to solving a linear program (in contrast to a linear system for ${\ell_{2}}$ homotopy). Hence, the trick of up- and downdating a suitable factorization of a suitable matrix to speed up computation does not work. However, one can somehow leverage the special structure of the problem and use a tailored active set method to progress through the path. Our numerical tests indicated is that the resulting method, which we termed ${\ell_{1}}$-Houdini, is able to solve moderately large problems faster than a commercial LP-solver (while also not only solving the given problem, but calculating the whole solution path on the fly) as can be seen from this table from the paper:

The code of $\ell_1$-Houdini is on Christoph’s homepage, you may also reproduce the data in the above table with your own hardware.

Today I gave a talk at STOR colloquium at the University of North Carolina (UNC). I spoke about the Master’s thesis of Lars Mescheder in which he developed probabilistic models for image denoising. Among other things, he proposed a Gaussian scale mixture model as an image prior and developed several methods to infer information from the respective posterior. One curios result was that the Perona-Malik diffusivity (and variants thereof) popped up in basically every algorithm he derived. It’s not only that the general idea to have an edge dependent diffusion coefficient in a PDE or, formally equivalent, a concave penalty for the magnitude of the gradient in a variational formulation, but it also turned out that the very diffusion coefficient ${\frac1{1+|\nabla u|^{2}}}$ appeared exactly in this form.

Besides the talk I got the chance to chat with many interesting people. I’d like to mention a chat about a background story on Perona-Malik which I find quite interesting:

Steven Pizer, Kenan Professor for computer science at UNC and active in a lot of fields for several decades, and in particular a driving for in the early days of digital image analysis, told that the very idea of the non-linear, gradient-magnitude dependent diffusion coefficient actually dates back to works of neuroscientists Stephen Grossberg and that he moderated a discussion between Stephan Grossberg, Pietro Perona and Jitendra Malik in which the agreed that the method should be attributed as “Grossberg-Perona-Malik” method. I could not find any more hints for this story anywhere else, but given the mathematical, psychological and biological background of Grossberg and a look at Grossbergs list of publications in the 1980’s and earlier make this story appear more than plausible.

It may also well be, that this story is just another instance of the effect, that good ideas usually pop up at more than one place…

Yesterday I uploaded the paper “Linear convergence of the Randomized Sparse Kaczmarz Method” by Frank Schöpfer and myself to the arXiv.

Recall that the Kaczmarz method for linear systems

$\displaystyle \begin{array}{rcl} Ax&=&b \end{array}$

iterates

$\displaystyle \begin{array}{rcl} x^{k+1} &=& x^{k} - \frac{\langle a_{i},x_{k}\rangle-b_{i}}{\|a_{i}\|^{2}}a_{i} \end{array}$

where ${a_{i}}$ is the ${i}$-th row of ${A}$, ${b_{i}}$ is the ${i}$th entry of ${b}$ and the index ${i}$ is chosen according to some rule. We could, e.g. choose the rows in a cyclic manner, i.e. starting with the first row, proceeding to the next row and once we came to the last row we would start over from the first again. It is known (and probably proved by Kaczmarz himself) that the method converges to a solution of ${Ax=b}$ whenever the system has a solution. Moreover, it easy to see that we converge to the minimum norm solution in case of underdetermined systems when the method is initialized with zero. This is due to the fact that the whole iteration takes place in the range space of ${A^{T}}$.

In this and this paper we proposed a simple modification of the Kaczmarz method, that makes it converge to sparse solutions. The modification is simply

$\displaystyle \begin{array}{rcl} z^{k+1} & = & z^{k} - \frac{\langle a_{i},x_{k}\rangle-b_{i}}{\|a_{i}\|^{2}}a_{i}\\ x^{k+1}& = & S_{\lambda}(z^{k+1}) \end{array}$

where ${S_{\lambda}(x) = \max(|x|-\lambda,0)\text{sign}(x)}$ is the soft thresholding function. In this paper we proved that this variant converges, when initialized with zero and for a consistent system, to the solution of

$\displaystyle \begin{array}{rcl} \min_{x}\lambda\|x\|_{1} + \tfrac12\|x\|_{2}^{2},\quad\text{s.t.}\quad Ax=b. \end{array}$

For not too small values of ${\lambda}$ this is indeed a sparse solution of ${Ax=b}$ and Frank also proved that there is a threshold such that for ${\lambda}$ larger than this threshold the solution is also the minimum ${\ell^{1}}$ solution.

In general, convergence rates for the Kaczmarz method (and its sparse variant) are hard to prove. To convince oneself of this fact note that the convergence speed can drastically change if the rows of the system are reordered. The situation changes if one uses a randomization of the sparse Kaczmarz method where, in each iteration, a row is chose at random. Strohmer and Vershynin proved that this randomization leads to linear convergence rate. In the above mentioned paper we were able to prove the same result for the randomized sparse Kaczmarz method. While this sounds like an obvious generalization, the methods we use are totally different. While the linear convergence of the randomized Kaczmarz method can be proven in a few lines(well, probably one page) using only very basic tools, we need, among other things, quite intricate error bounds for Bregman projections w.r.t. piecewise linear-quadratic convex functions.

In fact, the linear convergence can be observed in practice and we illustrate the usefulness of the randomization and also the “sparsification” on some examples in the paper. For example the following figure shows the decay of the residual for the the randomized Kaczmarz method (black), the randomized sparse Kaczmarz method (red) and the randomized sparse Kaczmarz method with exact steps (green) which I did not explain here.

More details are in the paper…