Here is a small signal boost for the

# Workshop on the Interface of Statistics and Optimization

to  be held at Duke University, Durham, North Carolina from Feb 8 to Feb 10 2017. The workshop is part of the long-year program on optimization currently taking place at the Statistical and Applied Mathematical Sciences Institute (SAMSI).

There will be a lineup of invited speakers from the forefront of Statistics and Optimization each of which has made influential contributions to the other field as well. The planning is still ongoing, and hence, the list of speakers will grow some.

If you can’t make it to North Carolina next February, still mark the date since the talks will be broadcasted via the net and (if tech works out) you may even participate in the Q&A sessions after the talks via your computer.

Consider the simple linear transport equation

$\displaystyle \partial_t f + v\partial_x f = 0,\quad f(x,0) = \phi(x)$

with a velocity ${v}$. Of course the solution is

$\displaystyle f(x,t) = \phi(x-tv),$

i.e. the initial datum is just transported in direction of ${v}$, as the name of the equation suggests. We may also view the solution ${f}$ as not only depending on space ${x}$ and time ${t}$ but also dependent on the velocity ${v}$, i.e. we write ${f(x,t,v) =\phi(x-tv)}$.

Now consider that the velocity is not really known but somehow uncertain (while the initial datum ${\phi}$ is still known exactly). Hence, it does not make too much sense to look at the exact solution ${f}$, because the effect of a wrong velocity will get linearly amplified in time. It seems more sensible to assume a distribution ${\rho}$ of velocities and look at the averaged solutions that correspond to the different velocities ${v}$. Hence, the quantity to look at would be

$\displaystyle g(x,t) = \int_{-\infty}^\infty f(x,t,v)\rho(v) dv.$

Let’s have a closer look at the averaged solution ${g}$. We write out ${f}$, perform a change of variables and end up with

$\displaystyle \begin{array}{rcl} g(x,t) & = &\int_{-\infty}^\infty f(x,t,v)\rho(v)dv\\ & =& \int_{-\infty}^\infty \phi(x-tv)\rho(v)dv\\ & =& \int_{-\infty}^\infty \phi(x-w)\tfrac1t\rho(w/t)dw. \end{array}$

In the case of a Gaussian distribution ${\rho}$, i.e.

$\displaystyle \rho(v) = \frac{1}{\sqrt{4\pi}}\exp\Big(-\frac{v^2}{4}\Big)$

we get

$\displaystyle g(x,t) =\int_{-\infty}^\infty \phi(x-w)G(t,w)dw$

with

$\displaystyle G(t,w) = \frac{1}{\sqrt{4\pi}\,t}\exp\Big(-\frac{w^2}{4t^2}\Big).$

Now we make a time rescaling ${\tau = t^2}$, denote ${h(x,\tau) = h(x,t^2) = g(x,t)}$ and see that

$\displaystyle h(x,\tau) = \int_{-\infty}^\infty \phi(x-w)\frac{1}{\sqrt{4\pi \tau}}\exp\Big(-\frac{w^2}{4\tau}\Big)dw.$

So what’s the point of all this? It turns out that the averaged and time rescaled solution ${h}$ of the transport equation indeed solves the heat equation

$\displaystyle \partial_t h - \partial_{xx} h = 0,\quad h(x,0) = \phi(x).$

In other words, velocity averaging and time rescaling turn a transport equation (a hyperbolic PDE) into a diffusion equation (a parabolic PDE).

I’ve seen this derivation in a talk by Enrique Zuazua in his talk at SciCADE 2015.

To end this blog post, consider the slight generalization of the transport equation

$\displaystyle \partial_t f + \psi(x,v)\partial_x f = 0$

where the velocity depends on ${x}$ and ${v}$. According to Enrique Zuazua it’s open what happens here when you average over velocities…

I am at the SSVM 13 and the first day is over. The conference is in a very cozy place in Austria, namely the Schloss Seggau which is located on a small hill near the small town Leibnitz in Styria (which is in Austria but totally unaffected by the floods). The conference is also not very large but there is a good crowd of interesting people here and the program reads interesting.

The time schedule is tight and I don’t know if I can make it to post something everyday. Especially, I will not be able to blog about every talk.

From the first day I have the impression that two things have appeared frequently at different places:

• Differential geometry: Tools from differential geometry like the notion of geodesics or Riemannian metrics have not only been used by Martin Rumpf to model different types of spaces of shapes but also to interpolate between textures.
• Primal dual methods: Variational models should be convex these days and if they are not you should make them convex somehow. Then you can use primal dual methods. Somehow the magic of primal dual methods is that they are incredibly flexible. Also they work robustly and reasonably fast.

Probably, there are more trends to see in the next days.

ISMP is over now and I’m already home. I do not have many things to report on from the last day. This is not due the lower quality of the talks but due to the fact that I was a little bit exhausted, as usual at the end of a five-day conference. However, I collect a few things for the record:

• In the morning I visited the semi-planary by Xiaojun Chenon non-convex and non-smooth minimization with smoothing methods. Not surprisingly, she treated the problem

$\displaystyle \min_x f(x) + \|x\|_p^p$

with convex and smooth ${f:{\mathbb R}^n\rightarrow{\mathbb R}}$ and ${0. She proposed and analyzed smoothing methods, that is, to smooth the problem a bit to obtain a Lipschitz-continuous objective function ${\phi_\epsilon}$, minimizing this and then gradually decreasing ${\epsilon}$. This works, as she showed. If I remember correctly, she also treated “iteratively reweighted least squares” as I described in my previous post. Unfortunately, she did not include the generalized forward-backward methods based on ${\text{prox}}$-functions for non-convex functions. Kristian and I pursued this approach in our paper Minimization of non-smooth, non-convex functionals by iterative thresholding and some special features of our analysis include:

• A condition which excludes some (but not all) local minimizers from being global.
• An algorithm which avoids this non-global minimizers by carefully adjusting the steplength of the method.
• A result that the number of local minimizers is still finite, even if the problem is posed in ${\ell^2({\mathbb N})}$ and not in ${{\mathbb R}^n}$.

Most of our results hold true, if the ${p}$-quasi-norm is replaced by functions of the form

$\displaystyle \sum_n \phi_n(|x_n|)$

with special non-convex ${\phi}$, namely fulfilling a list of assumptions like

• ${\phi'(x) \rightarrow \infty}$ for ${x\rightarrow 0}$ (infinite slope at ${0}$) and ${\phi(x)\rightarrow\infty}$ for ${x\rightarrow\infty}$ (mild coercivity),
• ${\phi'}$ strictly convex on ${]0,\infty[}$ and ${\phi'(x)/x\rightarrow 0}$ for ${x\rightarrow\infty}$,
• for each ${b>0}$ there is ${a>0}$ such that for ${x it holds that ${\phi(x)>ax^2}$, and
• local integrability of some section of ${\partial\phi'(x) x}$.

As one easily sees, ${p}$-quasi-norms fulfill the assumptions and some other interesting functions as well (e.g. some with very steep slope at ${0}$ like ${x\mapsto \log(x^{1/3}+1)}$).

• Jorge Nocedalgave a talk on second-order methods for non-smooth problems and his main example was a functional like

$\displaystyle \min_x f(x) + \|x\|_1$

with a convex and smooth ${f}$, but different from Xiaojun Chen, he only considered the ${1}$-norm. His talked is among the best planary talks I have ever attended and it was a great pleasure to listen to him. He carefully explained things and put them in perspective. In the case he skipped slides, he made me feel that I either did not miss an important thing, or understood them even though he didn’t show them He argued that it is not necessarily more expensive to use second order information in contrast to first order methods. Indeed, the ${1}$-norm can be used to reduce the number of degrees of freedom for a second order step. What was pretty interesting is, that he advocated semismooth Newton methods for this problem. Roland and I pursued this approach some time ago in our paper A Semismooth Newton Method for Tikhonov Functionals with Sparsity Constraints and, if I remember correctly (my notes are not complete at this point), his family of methods included our ssn-method. The method Roland and I proposed worked amazingly well in the cases in which it converged but the method suffered from non-global convergence. We had some preliminary ideas for globalization, which we could not tune enough to retain the speed of the method, and abandoned the topic. Now, that the topic will most probably be revived by the community, I am looking forward to fresh ideas here.

Today there are several things I could blog on. The first is the planary by Rich Baraniuk on Compressed Sensing. However, I don’t think that I could reflect the content in a way which would be helpful for a potential reader. Just for the record: If you have the chance to visit one of Rich’s talk: Do it!

The second thing is the talk by Bernd Hofmann on source conditions, smoothness and variational inequalities and their use in regularization of inverse problems. However, this would be too technical for now and I just did not take enough notes to write a meaningful post.

As a third thing I have the talk by Christian Clason on inverse problems with uniformly distributed noise. He argued that for uniform noise it is much better to use an ${L^\infty}$ discrepancy term instead of the usual ${L^2}$-one. He presented a path-following semismooth Newton method to solve the problem

$\displaystyle \min_x \frac{1}{p}\|Kx-y^\delta\|_\infty^p + \frac{\alpha}{2}\|x\|_2^2$

and showed examples with different kinds of noise. Indeed the examples showed that ${L^\infty}$ works much better than ${L^2}$ here. But in fact it works even better, if the noise is not uniformly distributed but “impulsive” i.e. it attains bounds ${\pm\delta}$ almost everywhere. It seems to me that uniform noise would need a slightly different penalty but I don’t know which one – probably you do? Moreover, Christian presented the balancing principle to choose the regularization parameter (without knowledge about the noise level) and this was the first time I really got what it’s about. What one does here is, to choose ${\alpha}$ such that (for some ${\sigma>0}$ which only depends on ${K}$, but not on the noise)

$\displaystyle \sigma\|Kx_\alpha^\delta-y^\delta\|_\infty = \frac{\alpha}{2}\|x_\alpha^\delta\|_2^2.$

The rational behind this is, that the left hand side is monotonically non-decreasing in ${\alpha}$, while the right hand side is monotonically non-increasing. Hence, there should be some ${\alpha}$ “in the middle” which make both somewhat equally large. Of course, we do neither want to “over-regularize” (which would usually “smooth too much”) nor to “under-regularize” (which would not eliminate noise). Hence, balancing seems to be a valid choice. From a practical point of view the balancing is also nice because one can use the fixed-point iteration

$\displaystyle \alpha^{n+1} = 2\sigma\frac{\|Kx_{\alpha^n}^\delta - y^\delta\|_\infty}{\|x_{\alpha_n}^\delta\|_2^2}$

which converges in a few number of iterations.

Then there was the talk by Esther Klann, but unfortunately, I was late so only heard the last half…

Last but not least we have the talk by Christiane Pöschl. If you are interested in Total-Variation-Denoising (TV denoising), then you probably have heard many times that “TV denoising preserves edges” (have a look at the Wikipedia page – it claims this twice). What Christiane showed (in a work with Vicent Caselles and M. Novaga) that this claim is not true in general but only for very special cases. In case of characteristic functions, the only functions for which the TV minimizer has sharp edges are these so-called calibrated sets, introduced by Caselles et el. Building on earlier works by Caselles and co-workers she calculated exact minimizers for TV denoising in the case that the image consists of characteristic functions of two convex sets or of a single star shaped domain, that is, for a given set $B$ she calculated the solution of

$\displaystyle \min_u\int (u - \chi_B)^2dx + \lambda \int|Du|.$

This is not is as easy as it may sound. Even for the minimizer for a single convex set one has to make some effort. She presented a nice connection of the shape of the obtained level-sets with the morphological operators of closing and opening. With the help of this link she derived a methodology to obtain the exact TV denoising minimizer for all parameters. I do not have the images right now but be assured that most of the time, the minimizers do not have sharp edges all over the place. Even for simple geometries (like two rectangles touching in a corner) strange things happen and only very few sharp edges appear. I’ll keep you posted in case the paper comes out (or appears as a preprint).

Christiane has some nice images which make this much more clear:

For two circles edges are preserved if they are far enough away from each other. If they are close, the area “in between” them is filled and, moreover, obey this fuzzy boundary. I remember myself seeing effects like this in the output of TV-solvers and thinking “well, it seems that the algorithm is either not good or not converged yet – TV should output sharp edges!”.

For a star-shaped shape (well, actually a star) the output looks like this. The corners are not only rounded but also blurred and this is true both for the “outer” corners and the “inner” corners.

So, if you have any TV-minimizing code, go ahead and check if your code actually does the right things on images like this!
Moreover, I would love to see similar results for more complicated extensions of TV like Total Generalized Variation, I treated here.

Today I report on two things I came across here at ISMP:

• The first is a talk by Russell Luke on Constraint qualifications for nonconvex feasibility problems. Luke treated the NP-hard problem of sparsest solutions of linear systems. In fact he did not tackle this problem but the problem to find an ${s}$-sparse solution of an ${m\times n}$ system of equations. He formulated this as a feasibility-problem (well, Heinz Bauschke was a collaborator) as follows: With the usual malpractice let us denote by ${\|x\|_0}$ the number of non-zero entries of ${x\in{\mathbb R}^n}$. Then the problem of finding an ${s}$-sparse solution to ${Ax=b}$ is:

$\displaystyle \text{Find}\ x\ \text{in}\ \{\|x\|_0\leq s\}\cap\{Ax=b\}.$

In other words: find a feasible point, i.e. a point which lies in the intersection of the two sets. Well, most often feasibility problems involve convex sets but here, the first one given by this “${0}$-norm” is definitely not convex. One of the simplest algorithms for the convex feasibility problem is to alternatingly project onto both sets. This algorithm dates back to von Neumann and has been analyzed in great detail. To make this method work for non-convex sets one only needs to know how to project onto both sets. For the case of the equality constraint ${Ax=b}$ one can use numerical linear algebra to obtain the projection. The non-convex constraint on the number of non-zero entries is in fact even easier: For ${x\in{\mathbb R}^n}$ the projection onto ${\{\|x\|_0\leq s\}}$ consists of just keeping the ${s}$ largest entries of ${x}$ while setting the others to zero (known as the “best ${s}$-term approximation”). However, the theory breaks down in the case of non-convex sets. Russell treated problem in several papers (have a look at his publication page) and in the talk he focused on the problem of constraint qualification, i.e. what kind of regularity has to be imposed on the intersection of the two sets. He could shows that (local) linear convergence of the algorithm (which is observed numerically) can indeed be justified theoretically. One point which is still open is the phenomenon that the method seems to be convergent regardless of the initialization and that (even more surprisingly) that the limit point seems to be independent of the starting point (and also seems to be robust with respect to overestimating the sparsity ${s}$). I wondered if his results are robust with respect to inexact projections. For larger problems the projection onto the equality constraint ${Ax=b}$ are computationally expensive. For example it would be interesting to see what happens if one approximates the projection with a truncated CG-iteration as Andreas, Marc and I did in our paper on subgradient methods for Basis Pursuit.

• Joel Tropp reported on his paper Sharp recovery bounds for convex deconvolution, with applications together with Michael McCoy. However, in his title he used demixing instead of deconvolution (which, I think, is more appropriate and leads to less confusion). With “demixing” they mean the following: Suppose you have two signals ${x_0}$ and ${y_0}$ of which you observe only the superposition of ${x_0}$ and a unitarily transformed ${y_0}$, i.e. for a unitary matrix ${U}$ you observe

$\displaystyle z_0 = x_0 + Uy_0.$

Of course, without further assumptions there is no way to recover ${x_0}$ and ${y_0}$ from the knowledge of ${z_0}$ and ${U}$. As one motivation he used the assumption that both ${x_0}$ and ${y_0}$ are sparse. After the big bang of compressed sensing nobody wonders that one turns to convex optimization with ${\ell^1}$-norms in the following manner:

$\displaystyle \min_{x,y} \|x\|_1 + \lambda\|y\|_1 \ \text{such that}\ x + Uy = z_0. \ \ \ \ \ (1)$

This looks a lot like sparse approximation: Eliminating ${x}$ one obtains the unconstraint problem \begin{equation*} \min_y \|z_0-Uy\|_1 + \lambda \|y\|_1. \end{equation*}

Phrased differently, this problem aims at finding an approximate sparse solution of ${Uy=z_0}$ such that the residual (could also say “noise”) ${z_0-Uy=x}$ is also sparse. This differs from the common Basis Pursuit Denoising (BPDN) by the structure function for the residual (which is the squared ${2}$-norm). This is due to the fact that in BPDN one usually assumes Gaussian noise which naturally lead to the squared ${2}$-norm. Well, one man’s noise is the other man’s signal, as we see here. Tropp and McCoy obtained very sharp thresholds on the sparsity of ${x_0}$ and ${y_0}$ which allow for exact recovery of both of them by solving (1). One thing which makes their analysis simpler is the following reformulation: The treated the related problem \begin{equation*} \min_{x,y} \|x\|_1 \text{such that} \|y\|_1\leq\alpha, x+Uy=z_0 \end{equation*} (which I would call this the Ivanov version of the Tikhonov-problem (1)). This allows for precise exploitation of prior knowledge by assuming that the number ${\alpha_0 = \|y_0\|_1}$ is known.

First I wondered if this reformulation was responsible for their unusual sharp results (sharper the results for exact recovery by BDPN), but I think it’s not. I think this is due to the fact that they have this strong assumption on the “residual”, namely that it is sparse. This can be formulated with the help of ${1}$-norm (which is “non-smooth”) in contrast to the smooth ${2}$-norm which is what one gets as prior for Gaussian noise. Moreover, McCoy and Tropp generalized their result to the case in which the structure of ${x_0}$ and ${y_0}$ is formulated by two functionals ${f}$ and ${g}$, respectively. Assuming a kind of non-smoothness of ${f}$ and ${g}$ the obtain the same kind of results and especially matrix decomposition problems are covered.

The second day of ISMP started (for me) with the session I organized and chaired.

The first talk was by Michael Goldman on Continuous Primal-Dual Methods in Image Processing. He considered the continuous Arrow-Hurwitz method for saddle point problems

$\displaystyle \min_{u}\max_{\xi} K(u,\xi)$

with ${K}$ convex in the first and concave in the second variable. The continuous Arrow-Hurwitz method consists of solving

$\displaystyle \begin{array}{rcl} \partial_t u(t) &=& -\nabla_u K(u(t),\xi(t))\\ \partial_t \xi(t) &=& \nabla_\xi K(u(t),\xi(t)). \end{array}$

His talk evolved around the problem if ${K}$ comes from a functional which contains the total variation, namely he considered

$\displaystyle K(u,\xi) = -\int_\Omega u\text{div}(\xi) + G(u)$

with the additional constraints ${\xi\in C^1_C(\Omega,{\mathbb R}^2)}$ and ${|\xi|\leq 1}$. For the case of ${G(u) = \lambda\|u-f\|^2/2}$ he presented a nice analysis of the problem including convergence of the method to a solution of the primal problem and some a-posteriori estimates. This reminded me of Showalters method for the regularization of ill-posed problems. The Arrow-Hurwitz method looks like a regularized version of Showalters method and hence, early stopping does not seem to be necessary for regularization. The related paper is Continuous Primal-Dual Methods for Image Processing.

The second talk was given by Elias Helou and was on Incremental Algorithms for Convex Non-Smooth Optimization with Applications in Image Reconstructions. He presented his work on a very general framework for problems of the class

$\displaystyle \min_{x\in X} f(x)$

with a convex function ${f}$ and a convex set ${X}$. Basically, he abstracted the properties of the projected subgradient method. This consists of taking subgradient descent steps for ${f}$ followed by projection onto ${X}$ iteratively: With a subgradient ${g^n\in\partial f(x^n)}$ this reads as

$\displaystyle x^{n+1} = P_X(x^n -\alpha_n g^n)$

he extracted the conditions one needs from the subgradient descent step and from the projection step and formulated an algorithm which consists of successive application of an “optimality operator” ${\mathcal{O}_f}$ (replacing the subgradient step) and a feasibility operator ${\mathcal{F}_X}$ (replacing the projection step). The algorithm then reads as

$\displaystyle \begin{array}{rcl} x^{n+1/2} &=& \mathcal{O}_f(x^n,\alpha_n)\\ x^{n+1} &=& \mathcal{F}_x(x^{n+1/2} \end{array}$

and he showed convergence under the extracted conditions. The related paper is , Incremental Subgradients for Constrained Convex Optimization: a Unified Framework and New Methods.

The third talk was by Jerome Fehrenbach on Stripes removel in images, apllications in microscopy. He considered the problem of very specific noise which is appear in the form of stripes (and appears, for example, “single plane illumination microscopy”). In fact he considered a little more general case and the model he proposed was as follows: The observed image is

$\displaystyle u_{\text{OBS}} = u + n,$

i.e. the usual sum of the true image ${u}$ and noise ${n}$. However, for the noise he assumed that it is given by

$\displaystyle n = \sum_{j=1}^m \psi_j*\lambda_j,$

i.e. it is a sum of different convolutions. The ${\psi_j}$ are kind of shape-functions which describe the “pattern of the noise” and the ${\lambda_j}$ are samples of noise processes, following specific distributions (could be white noise realizations, impulsive noise or something else)-. He then formulated a variational method to identify the variables ${\lambda_j}$ which reads as

$\displaystyle \min \|\nabla(u_{\text{OBS}} - \sum_{j=1}^m \psi_j*\lambda_j)\|_1 + \sum_j \phi_j(\lambda_j).$

Basically, this is the usual variational approach to image denoising, but nor the optimization variable is the noise rather than the image. This is due to the fact that the noise has a specific complicated structure and the usual formulation with ${u = u_{\text{OBS}} +n}$ is not feasible. He used the primal-dual algorithm by Chambolle and Pock for this problem and showed that the method works well on real world problems.

Another theme which caught my attention here is “optimization with variational inequalities as constraints”. At first glance that sounds pretty awkward. Variational inequalities can be quite complicated things and why on earth would somebody considers these things as side conditions in optimization problems? In fact there are good reasons to do so. One reason is, if you have to deal with bi-level optimization problems. Consider an optimization problem

$\displaystyle \min_{x\in C} F(x,p) \ \ \ \ \ (1)$

with convex ${C}$ and ${F(\cdot,p)}$ (omitting regularity conditions which could be necessary to impose) depending on a parameter ${p}$. Now consider the case that you want to choose the parameter ${p}$ in an optimal way, i.e. it solves another optimization problem. This could look like

$\displaystyle \min_p G(x)\quad\text{s.t.}\ x\ \text{solves (1)}. \ \ \ \ \ (2)$

Now you have an optimization problem as a constraint. Now we use the optimality condition for the problem~(1): For differentiable ${F}$, ${x^*}$ solves~(1) if and only if

$\displaystyle \forall y\in C:\ \nabla_x F(x^*(p),p)(y-x^*(p))\geq 0.$

In other words: We con reformulate (2) as

$\displaystyle \min_p G(x)\quad\text{s.t.}\ \forall y\in C:\ \nabla_x F(x^*(p),p)(y-x^*(p))\geq 0. \ \ \ \ \ (3)$

And there it is, our optimization problem with a variational inequality as constraint. Here at ISMP there are entire sessions devoted to this, see here and here.