Yesterday I uploaded the paper An extended Perona-Malik model based on probabilistic models by Lars Mescheder and myself to the arXiv. Recently I already blogged about this work, so I do not have to add that much. The main theme of the work was that if we have an image that is a blurred and noisy version of some true image, we formulate the reconstruction via Bayesian statistics. As a prior model for images we used a Gaussian scale mixture, i.e. we have a latent variable $z$ (in every pixel $x$) and the joint prior for the image $u$ and the latent variable $z$ is

$p(u,z) \propto \exp\Big(-\sum_x \frac{z(x)}2 |\nabla u(x)|^2 + v(z(x))\Big)$

where $x$ denotes the pixels, $\nabla$ is the discrete gradient of the image and $v$ is some non-negative function defined on the non-negative reals. Besides algorithms to approximate the MAP estimate, Lars proposed a mean-field approximation which does not calculate the most probable image, but iteratively approximates the posterior distribution by distributions which factorize over the variables $u$ and $z$. Using some further approximation (since the resulting algorithm in its plain form is not implementable) one arrives at an algorithm which some keeps more uncertainty in $u$ and, in practice, gives a point estimate for the denoised image that seems more “representative”. Here is an example:

This is the blurred and noisy image (the blur is motions blur and implemented with periodic boundary conditions for simplicity):

The next image is the approximation of the MAP estimate we got and you see the usual drawbacks. Since the MAP neglects all uncertainty in $u$ and maximizes the posterior probability, the image is “too sharp” in the way that smooth transitions (e.g. at the lighthouse) turn into piecewise constant stairs. Also the rocks appear very blocky.

Here is the result from the mean-field approximation. Since uncertainty in $u$ is taken into account, the image does not suffer from staircasing and also has a more natural appeal, most prominently in the part with the rocks.

The paper has some more examples and also shows another relation to Mumford-Shah denoising (loosely speaking, one uses a discrete latent variable $z$ to serve as a binary variable to say if a pixel is an edge or not). Oh, by the way, the algorithms behind the mean-field approximation use some parts of more general duality between random variables that Lars and his co-authors develop in another paper.

Last week Christoph Brauer, Andreas Tillmann and myself uploaded the paper A Primal-Dual Homotopy Algorithm for ${\ell_{1}}$-Minimization with ${\ell_{\infty}}$-Constraints to the arXiv (and we missed being the first ever arXiv-paper with a non-trivial five-digit identifier by twenty-something papers…). This paper specifically deals with the optimization problem

$\displaystyle \begin{array}{rcl} \min_{x}\|x\|_{1}\quad\text{s.t.}\quad \|Ax-b\|_{\infty}\leq \delta \end{array}$

where ${A}$ and ${b}$ are a real matrix and vector, respecively, of compatible size. While the related problem with ${\ell_{2}}$ constraint has been addressed quite often (and the penalized problem ${\min_{x}\|x\|_{1} + \tfrac1{2\lambda}\|Ax-b\|_{2}^{2}}$ is even more popular) there is not much code around to solve this specific problem. Obvious candidates are

• Linear optimization: The problem can be recast as a linear program: The constraint is basically linear already (rewriting it with help of the ones vector ${\mathbf{1}}$ as ${Ax\leq \delta\mathbf{1}+b}$, ${-Ax\leq \delta\mathbf{1}-b}$) and for the objective one can, for example, perform a variable split ${x = x^{+}-x^{-}}$, ${x^{-},x^{+}\geq 0}$ and then write ${\|x\|_{1} = \mathbf{1}^{T}x^{+}+ \mathbf{1}^{T}x^{-}}$.
• Splitting methods: The problem is convex problem of the form ${\min_{x}F(x) + G(Ax)}$ with

$\displaystyle \begin{array}{rcl} F(x) & = & \|x\|_{1}\\ G(y) & = & \begin{cases} 0 & \|y-b\|\leq\delta\\ \infty & \text{else.} \end{cases} \end{array}$

and hence, several methods for these kind of problem are available, such as the alternating direction method of multipliers or the Chambolle-Pock algorithm.

The formulation as linear program has the advantage that one can choose among a lot of highly sophisticated software tools and the second has the advantage that the methods are very easy to code, usually in just a few lines. Drawbacks are, that both methods do exploit too much specific structure of the problem.

Application of the problem with ${\ell_{\infty}}$ constraint are, for example:

• The Dantzig selector, a statistical estimation technique, were the problem is

$\displaystyle \begin{array}{rcl} \min_{x}\|x\|_{1}\quad\text{s.t.}\quad \|A^{T}(Ax-b)\|_{\infty}\leq\delta. \end{array}$

• Sparse dequantization as elaborated here by Jacques, Hammond and Fadili and applied here to de-quantizaton of speech signals by Christoph, Timo Gerkmann and myself.

We wanted to see if one of the most efficient methods known for sparse reconstruction with ${\ell_{2}}$ penalty, namely the homotopy method, can be adapted to this case. The homotopy method for ${\min_{x}\|x\|_{1} + \tfrac1{2\lambda}\|Ax-b\|_{2}^{2}}$ builds on the observation that the solution for ${\lambda\geq \|A^{T}b\|_{\infty}}$ is zero and that the set of solutions ${x_{\lambda}}$, parameterized by the parameter ${\lambda}$, is piecewise linear. Hence, on can start from ${\lambda_{0}= \|A^{T}b\|_{\infty}}$, calculate which direction to go, how far the breakpoint is away, go there and start over. I’ve blogged on the homotopy method here already and there you’ll find some links to great software packages, but also the fact that there can be up to exponentially many breakpoints. However, in practice the homotopy method is usually blazingly fast and with some care, can be made numerically stable and accurate, see, e.g. our extensive study here (and here is the optimization online preprint).

The problem with ${\ell_{\infty}}$ constraint seems similar, especially it is clear that for ${\delta = \|b\|_{\infty}}$, ${x=0}$ is a solution. It is also not so difficult to see that there is a piecewise linear path of solutions ${x_{\delta}}$. What is not so clear is, how it can be computed. It turned out, that in this case the whole truth can be seen when the problem is viewed from a primal-dual viewpoint. The associated dual problem is

$\displaystyle \begin{array}{rcl} \max_{y}\ -b^{T}y - \delta\|y\|_{1}\quad\text{s.t.}\quad\|A^{T}y\|_{\infty}\leq\infty \end{array}$

and a pair ${(x^{*},y^{*})}$ is primal and dual optimal if and only if

$\displaystyle \begin{array}{rcl} -A^{T}y^{*}&\in&\text{Sign}(x^{*})\\ Ax^{*}-b & \in & \delta\text{Sign}(y^{*}) \end{array}$

(where ${\text{Sign}}$ denotes the sign function, multivalued at zero, giving ${[-1,1]}$ there). One can note some things from the primal-dual optimality system:

• You may find a direction ${d}$ such that ${(x^{*}+td,y^{*})}$ stays primal-dual optimal for the constraint ${\leq\delta-t}$ for small ${t}$,
• for a fixed primal optimal ${x_{\delta}}$ there may be several dual optimal ${y_{\delta}}$.

On the other hand, it is not that clear which of the probably many dual optimal ${y_{\delta}}$ allows to find a new direction ${d}$ such that ${x_{\delta}+td}$ with stay primal optimal when reducing ${\delta}$. In fact, it turned out that, at a breakpoint, a new dual variable needs to be found to allow for the next jump in the primal variable. So, the solution path is piecewise linear in the primal variable, but piecewise constant in the dual variable (a situation similar to the adaptive inverse scale space method).

What we found is, that some adapted theorem of the alternative (a.k.a. Farkas’ Lemma) allows to calculate the next dual optimal ${y}$ such that a jump in ${x}$ will be possible.

What is more, is that the calculation of a new primal or dual optimal point amounts to solving a linear program (in contrast to a linear system for ${\ell_{2}}$ homotopy). Hence, the trick of up- and downdating a suitable factorization of a suitable matrix to speed up computation does not work. However, one can somehow leverage the special structure of the problem and use a tailored active set method to progress through the path. Our numerical tests indicated is that the resulting method, which we termed ${\ell_{1}}$-Houdini, is able to solve moderately large problems faster than a commercial LP-solver (while also not only solving the given problem, but calculating the whole solution path on the fly) as can be seen from this table from the paper:

The code of $\ell_1$-Houdini is on Christoph’s homepage, you may also reproduce the data in the above table with your own hardware.

Another few notes to myself:

In this post I just collect a few papers that caught my attention in the last moth.

I begin with Estimating Unknown Sparsity in Compressed Sensing by Miles E. Lopes. The abstract reads

Within the framework of compressed sensing, many theoretical guarantees for signal reconstruction require that the number of linear measurements ${n}$ exceed the sparsity ${\|x\|_0}$ of the unknown signal ${x\in\mathbb{R}^p}$. However, if the sparsity ${\|x\|_0}$ is unknown, the choice of ${n}$ remains problematic. This paper considers the problem of estimating the unknown degree of sparsity of ${x}$ with only a small number of linear measurements. Although we show that estimation of ${\|x\|_0}$ is generally intractable in this framework, we consider an alternative measure of sparsity ${s(x):=\frac{\|x\|_1^2}{\|x\|_2^2}}$, which is a sharp lower bound on ${\|x\|_0}$, and is more amenable to estimation. When ${x}$ is a non-negative vector, we propose a computationally efficient estimator ${\hat{s}(x)}$, and use non-asymptotic methods to bound the relative error of ${\hat{s}(x)}$ in terms of a finite number of measurements. Remarkably, the quality of estimation is dimension-free, which ensures that ${\hat{s}(x)}$ is well-suited to the high-dimensional regime where ${n<. These results also extend naturally to the problem of using linear measurements to estimate the rank of a positive semi-definite matrix, or the sparsity of a non-negative matrix. Finally, we show that if no structural assumption (such as non-negativity) is made on the signal ${x}$, then the quantity ${s(x)}$ cannot generally be estimated when ${n<.

It’s a nice combination of the observation that the quotient ${s(x)}$ is a sharp lower bound for ${\|x\|_0}$ and that it is possible to estimate the one-norm and the two norm of a vector ${x}$ (with additional properties) from carefully chosen measurements. For a non-negative vector ${x}$ you just measure with the constant-one vector which (in a noisy environment) gives you an estimate of ${\|x\|_1}$. Similarly, measuring with Gaussian random vector you can obtain an estimate of ${\|x\|_2}$.

Then there is the dissertation of Dustin Mixon on the arxiv: Sparse Signal Processing with Frame Theory which is well worth reading but too long to provide a short overview. Here is the abstract:

Many emerging applications involve sparse signals, and their processing is a subject of active research. We desire a large class of sensing matrices which allow the user to discern important properties of the measured sparse signal. Of particular interest are matrices with the restricted isometry property (RIP). RIP matrices are known to enable eﬃcient and stable reconstruction of sfficiently sparse signals, but the deterministic construction of such matrices has proven very dfficult. In this thesis, we discuss this matrix design problem in the context of a growing field of study known as frame theory. In the ﬁrst two chapters, we build large families of equiangular tight frames and full spark frames, and we discuss their relationship to RIP matrices as well as their utility in other aspects of sparse signal processing. In Chapter 3, we pave the road to deterministic RIP matrices, evaluating various techniques to demonstrate RIP, and making interesting connections with graph theory and number theory. We conclude in Chapter 4 with a coherence-based alternative to RIP, which provides near-optimal probabilistic guarantees for various aspects of sparse signal processing while at the same time admitting a whole host of deterministic constructions.

By the way, the thesis is dedicated “To all those who never dedicated a dissertation to themselves.”

Further we have Proximal Newton-type Methods for Minimizing Convex Objective Functions in Composite Form by Jason D Lee, Yuekai Sun, Michael A. Saunders. This paper extends the well explored first order methods for problem of the type ${\min g(x) + h(x)}$ with Lipschitz-differentiable ${g}$ or simple ${\mathrm{prox}_h}$ to second order Newton-type methods. The abstract reads

We consider minimizing convex objective functions in composite form

$\displaystyle \min_{x\in\mathbb{R}^n} f(x) := g(x) + h(x)$

where ${g}$ is convex and twice-continuously differentiable and ${h:\mathbb{R}^n\rightarrow\mathbb{R}}$ is a convex but not necessarily differentiable function whose proximal mapping can be evaluated efficiently. We derive a generalization of Newton-type methods to handle such convex but nonsmooth objective functions. Many problems of relevance in high-dimensional statistics, machine learning, and signal processing can be formulated in composite form. We prove such methods are globally convergent to a minimizer and achieve quadratic rates of convergence in the vicinity of a unique minimizer. We also demonstrate the performance of such methods using problems of relevance in machine learning and high-dimensional statistics.

With this post I say goodbye for a few weeks of holiday.