Optimization


If you are working on optimization with partial differential equations as constraints, you may be interested in the website

“OPTPDE – A Collection of Problems in PDE-Constrained Optimization”, http://www.optpde.net.

If you have developed an algorithm which can handle a certain class of optimization problems you need to do evaluations and tests on how well the method performs. To do so, you need well constructed test problems. This could be either problems where the optimal solution is known analytically our problems where the solution is known with a rigorous error bound obtained with a bullet-proof solver. Both things are not always easy to obtain and OPTPDE shall serve as a resource for such problems. It has been designed by Roland Herzog, Arnd Rösch, Stefan Ulbrich and Winnifried Wollner.

The generation of test instance for optimization problems seems quite important to me and indeed, several things can go wrong if this is not done right. Frequently, one sees tests for optimization routines on problems where the optimal solution is not known. Since there are usually different ways to express optimality conditions it is not always clear how to check for optimality; even more so, if you only check for “approximate optimality”, e.g. up to machine precision. A frequently observed effect is a kind of “trusted method bias”. By this I mean that an optimal solution is calculated by some trusted method and comparing the outcome of the tested routine with this solution. However, the trusted method uses some stopping criterion usually based on some specific set of formulations of optimality conditions and these can be different from what the new method has been tuned to. And most often, the stopping criteria do not give a rigorous error bound for the solution or the optimal objective value.

For sparse reconstruction problems, I dealt with this issue in “Constructing test instance for Basis Pursuit Denoising” (preprint available here) but I think this methodology could be used for other settings as well.

There are several answers to the following question:

1. What is a convex set?

For a convex set you probably know these definitions:

Definition 1 A subset {C} of a real vector space {V} is convex if for any {x,y\in C} and {\lambda\in[0,1]} it holds that {\lambda x + (1-\lambda)y\in C}.

In other words: If two points lie in the set, then every convex combination also lies in the set.

While this is a “definition from the inside”, convex sets can also be characterized “from the outside”. We add closedness as an assumption and get:

Definition 2 A closed subset {C} of a real locally convex topological vector space {V} is convex if it is the intersection of closed halfspaces (i.e. sets of the form {\{x\in V\ :\ \langle a,x\rangle\geq c\}} for some {a} in the dual space {V^*} and {c\in {\mathbb R}}).

Moreover, we could define convex sets via convex functions:

Definition 3 A set {C\subset V} is convex if there is a convex function {f:V\rightarrow {\mathbb R}} such that {C = \{x\ :\ f(x)\leq 0\}}.

Of course, this only makes sense once we have defined convex functions. Hence, we could also ask the question:

2. What is a convex function?

We can define a convex function by means of convex sets as follows:

Definition 4 A function {f:V\rightarrow{\mathbb R}} from a real vector space into the real numbers is convex, if its epigraph {\text{epi} f = \{(x,\mu)\ :\ f(x)\leq \mu\}\subset V\times {\mathbb R}} is convex (as a subset of the vector space {V\times{\mathbb R}}).

The epigraph consists of the points {(x,\mu)} which lie above the graph of the function and carries the same information as the function.

(Let me note that one can replace the real numbers here and in the following with the extended real numbers {\bar {\mathbb R} = {\mathbb R}\cup\{-\infty,\infty\}} if one uses the right extension of the arithmetic and the obvious ordering, but we do not consider this in this post.)

Because epigraphs are not arbitrary convex sets but have a special form (if a point {(x,\mu)} is in an epigraph, then every {(x,\lambda)} with {\lambda\geq \mu} is also in the epigraph), and because the underlying vector space {V\times {\mathbb R}} comes with an order in the second component, some of the definitions for convex sets from above have a specialized form:

From the definition “convex combinations stay in the set” we get:

Definition 5 A function {f:V\rightarrow {\mathbb R}} is convex, if for all {x,y\in V} and {\lambda\in [0,1]} it holds that

\displaystyle f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y).

In other words: The secant of any two points on the graph lies above the graph.

From the definition by “intersection of half spaces” we get another definition. Since we added closedness in this case we add this assumption also here. However, closedness of the epigraph is equivalent to lower-semicontinuity (lsc) of the function and since lsc functions are very convenient we use this notion:

Definition 6 A function {f:V\rightarrow{\mathbb R}} is convex and lsc if it is the pointwise supremum of affine functions, i.e., for some set {S} of tuples {(a,c)\in V^*\times {\mathbb R}} it holds that

\displaystyle f(x) = \sup_{(a,c) \in S} \langle a,x\rangle + c.

A special consequence if this definition is that tangents to the graph of a convex function lie below the function. Another important consequence of this fact is that the local behavior of the function, i.e. its tangent plane at some point, carries some information about the global behavior. Especially, the property that the function lies above its tangent planes allows one to conclude that local minima of the function are also global. Probably the last properties are the ones which give convex functions a distinguished role, especially in the field of optimization.

Some of the previous definitions allow for generalizations into several direction and the quest for the abstract core of the notion of convexity has lead to the field of abstract convexity.

3. Abstract convexity

When searching for abstractions of the notion of convexity one may get confused by the various different approaches. For example there are generalized notions of convexity, e.g. for function of spaces of matrices (e.g. rank-one convexity, quasi-convexity or polyconvexity) and there are also further different notions like pseudo-convexity, invexity or another form ofquasi-convexity. Here we do not have generalization in mind but abstraction. Although both things are somehow related, the way of thinking is a bit different. Our aim is not to find a useful notion which is more general than the notion of convexity but to find a formulation which contains the notion of convexity and abstracts away some ingredients which probably not carry the essence of the notion.

In the literature one also finds several approaches in this direction and I mention only my favorite one (and I restrict myself to abstractly convex functions and not write about abstractly convex sets).

To me, the most appealing notion of an abstract convex function is an abstraction of the definition as “pointwise supremum of affine functions”. Let’s look again at the definition:

A function {f:V\rightarrow {\mathbb R}} is convex and lsc if there is a subset {S} of {V^*\times {\mathbb R}} such that

\displaystyle f(x) = \sup_{(a,c)\in S} \langle a,x\rangle + c.

We abstract away the vector space structure and hence, also the duality, but keep the real-valuedness (together with its order) and define:

Definition 7 Let {X} be a set and let {W} be a set of real valued function on {X}. Then a function {f:X\rightarrow{\mathbb R}} is said to be {W}-convex if there is a subset {S} of {W\times {\mathbb R}} such that

\displaystyle f(x) = \sup_{(w,c)\in S} w(x) + c.

What we did in this definition was simply to replace continuous affine functions {x\mapsto \langle a,x\rangle} on a vector space by an arbitrary collection of real valued functions {x\mapsto w(x)} on a set. On sees immediately that for every function {w\in W} and any real number {c} the function {w+c} is {W} convex (similarly to the fact that every affine linear function is convex).

Another nice thing about this approach is, that it allows for some notion of duality/conjugation. For {f:V\rightarrow{\mathbb R}} we define the {W}-conjugate by

\displaystyle f^{W*}(w) = \sup_{w\in W} \Big(w(x)- f(x) \Big)

and we can even formulate a biconjugate

\displaystyle f^{W**}(x) = \sup_x \Big(w(x) - f^{W*}(w)\Big).

We naturally have a Fenchel inequality

\displaystyle w(x) \leq f(x) + f^{W*}(w)

and we may even define subgradients as usual. Note that a conventional subgradient is an element of the dual space which defines a tangent plane at the point where the subgradient is taken, that is, {a\in V^*} is a subgradient of {f} at {x}, if for all {y\in V} it holds that {f(y) \geq f(x) + \langle a,y-x\rangle} or

\displaystyle f(y) - \langle a,y\rangle\geq f(x) - \langle a,x\rangle.

A {W}-subgradient is an element of {W}, namely we define: {w\in\partial^{W}f(x)} if

\displaystyle \text{for all}\ y:\quad f(y) -w(y) \geq f(x) - w(x).

Then we also have a Fenchel equality:

\displaystyle w\in\partial^W f(x)\iff w(x) = f(x) + f^{W*}(w).

One may also take dualization as the starting point for an abstraction.

4. Abstract conjugation

We could formulate the {W}-conjugate as follows: For {\Phi(w,x) = w(x)} we have

\displaystyle f^{W*}(x) = \sup_w \Big(\Phi(w,x) - f(x)\Big).

This opens the door to another abstraction: For some sets {X,W} (without any additional structure) define a coupling function {\Phi: X\times W \rightarrow {\mathbb R}} and define the {\Phi}-conjugate as

\displaystyle f^{\Phi*}(w) = \sup_x \Big(\Phi(w,x) - f(x)\Big)

and the {\Phi}-biconjugate as

\displaystyle f^{\Phi**}(x) = \sup_x \Big(\Phi(w,x) - f^{W*}(w)\Big)

ISMP LogoISMP is over now and I’m already home. I do not have many things to report on from the last day. This is not due the lower quality of the talks but due to the fact that I was a little bit exhausted, as usual at the end of a five-day conference. However, I collect a few things for the record:

  • In the morning I visited the semi-planary by Xiaojun Chenon non-convex and non-smooth minimization with smoothing methods. Not surprisingly, she treated the problem

    \displaystyle \min_x f(x) + \|x\|_p^p

    with convex and smooth {f:{\mathbb R}^n\rightarrow{\mathbb R}} and {0<p<1}. She proposed and analyzed smoothing methods, that is, to smooth the problem a bit to obtain a Lipschitz-continuous objective function {\phi_\epsilon}, minimizing this and then gradually decreasing {\epsilon}. This works, as she showed. If I remember correctly, she also treated “iteratively reweighted least squares” as I described in my previous post. Unfortunately, she did not include the generalized forward-backward methods based on {\text{prox}}-functions for non-convex functions. Kristian and I pursued this approach in our paper Minimization of non-smooth, non-convex functionals by iterative thresholding and some special features of our analysis include:

    • A condition which excludes some (but not all) local minimizers from being global.
    • An algorithm which avoids this non-global minimizers by carefully adjusting the steplength of the method.
    • A result that the number of local minimizers is still finite, even if the problem is posed in {\ell^2({\mathbb N})} and not in {{\mathbb R}^n}.

    Most of our results hold true, if the {p}-quasi-norm is replaced by functions of the form

    \displaystyle \sum_n \phi_n(|x_n|)

    with special non-convex {\phi}, namely fulfilling a list of assumptions like

    • {\phi'(x) \rightarrow \infty} for {x\rightarrow 0} (infinite slope at {0}) and {\phi(x)\rightarrow\infty} for {x\rightarrow\infty} (mild coercivity),
    • {\phi'} strictly convex on {]0,\infty[} and {\phi'(x)/x\rightarrow 0} for {x\rightarrow\infty},
    • for each {b>0} there is {a>0} such that for {x<b} it holds that {\phi(x)>ax^2}, and
    • local integrability of some section of {\partial\phi'(x) x}.

    As one easily sees, {p}-quasi-norms fulfill the assumptions and some other interesting functions as well (e.g. some with very steep slope at {0} like {x\mapsto \log(x^{1/3}+1)}).

  • Jorge Nocedalgave a talk on second-order methods for non-smooth problems and his main example was a functional like

    \displaystyle \min_x f(x) + \|x\|_1

    with a convex and smooth {f}, but different from Xiaojun Chen, he only considered the {1}-norm. His talked is among the best planary talks I have ever attended and it was a great pleasure to listen to him. He carefully explained things and put them in perspective. In the case he skipped slides, he made me feel that I either did not miss an important thing, or understood them even though he didn’t show them He argued that it is not necessarily more expensive to use second order information in contrast to first order methods. Indeed, the {1}-norm can be used to reduce the number of degrees of freedom for a second order step. What was pretty interesting is, that he advocated semismooth Newton methods for this problem. Roland and I pursued this approach some time ago in our paper A Semismooth Newton Method for Tikhonov Functionals with Sparsity Constraints and, if I remember correctly (my notes are not complete at this point), his family of methods included our ssn-method. The method Roland and I proposed worked amazingly well in the cases in which it converged but the method suffered from non-global convergence. We had some preliminary ideas for globalization, which we could not tune enough to retain the speed of the method, and abandoned the topic. Now, that the topic will most probably be revived by the community, I am looking forward to fresh ideas here.

Today I report on two things I came across here at ISMP:

  • The first is a talk by Russell Luke on Constraint qualifications for nonconvex feasibility problems. Luke treated the NP-hard problem of sparsest solutions of linear systems. In fact he did not tackle this problem but the problem to find an {s}-sparse solution of an {m\times n} system of equations. He formulated this as a feasibility-problem (well, Heinz Bauschke was a collaborator) as follows: With the usual malpractice let us denote by {\|x\|_0} the number of non-zero entries of {x\in{\mathbb R}^n}. Then the problem of finding an {s}-sparse solution to {Ax=b} is:

    \displaystyle  \text{Find}\ x\ \text{in}\ \{\|x\|_0\leq s\}\cap\{Ax=b\}.

    In other words: find a feasible point, i.e. a point which lies in the intersection of the two sets. Well, most often feasibility problems involve convex sets but here, the first one given by this “{0}-norm” is definitely not convex. One of the simplest algorithms for the convex feasibility problem is to alternatingly project onto both sets. This algorithm dates back to von Neumann and has been analyzed in great detail. To make this method work for non-convex sets one only needs to know how to project onto both sets. For the case of the equality constraint {Ax=b} one can use numerical linear algebra to obtain the projection. The non-convex constraint on the number of non-zero entries is in fact even easier: For {x\in{\mathbb R}^n} the projection onto {\{\|x\|_0\leq s\}} consists of just keeping the {s} largest entries of {x} while setting the others to zero (known as the “best {s}-term approximation”). However, the theory breaks down in the case of non-convex sets. Russell treated problem in several papers (have a look at his publication page) and in the talk he focused on the problem of constraint qualification, i.e. what kind of regularity has to be imposed on the intersection of the two sets. He could shows that (local) linear convergence of the algorithm (which is observed numerically) can indeed be justified theoretically. One point which is still open is the phenomenon that the method seems to be convergent regardless of the initialization and that (even more surprisingly) that the limit point seems to be independent of the starting point (and also seems to be robust with respect to overestimating the sparsity {s}). I wondered if his results are robust with respect to inexact projections. For larger problems the projection onto the equality constraint {Ax=b} are computationally expensive. For example it would be interesting to see what happens if one approximates the projection with a truncated CG-iteration as Andreas, Marc and I did in our paper on subgradient methods for Basis Pursuit.

  • Joel Tropp reported on his paper Sharp recovery bounds for convex deconvolution, with applications together with Michael McCoy. However, in his title he used demixing instead of deconvolution (which, I think, is more appropriate and leads to less confusion). With “demixing” they mean the following: Suppose you have two signals {x_0} and {y_0} of which you observe only the superposition of {x_0} and a unitarily transformed {y_0}, i.e. for a unitary matrix {U} you observe

    \displaystyle  z_0 = x_0 + Uy_0.

    Of course, without further assumptions there is no way to recover {x_0} and {y_0} from the knowledge of {z_0} and {U}. As one motivation he used the assumption that both {x_0} and {y_0} are sparse. After the big bang of compressed sensing nobody wonders that one turns to convex optimization with {\ell^1}-norms in the following manner:

    \displaystyle   \min_{x,y} \|x\|_1 + \lambda\|y\|_1 \ \text{such that}\ x + Uy = z_0. \ \ \ \ \ (1)

    This looks a lot like sparse approximation: Eliminating {x} one obtains the unconstraint problem \begin{equation*} \min_y \|z_0-Uy\|_1 + \lambda \|y\|_1. \end{equation*}

    Phrased differently, this problem aims at finding an approximate sparse solution of {Uy=z_0} such that the residual (could also say “noise”) {z_0-Uy=x} is also sparse. This differs from the common Basis Pursuit Denoising (BPDN) by the structure function for the residual (which is the squared {2}-norm). This is due to the fact that in BPDN one usually assumes Gaussian noise which naturally lead to the squared {2}-norm. Well, one man’s noise is the other man’s signal, as we see here. Tropp and McCoy obtained very sharp thresholds on the sparsity of {x_0} and {y_0} which allow for exact recovery of both of them by solving (1). One thing which makes their analysis simpler is the following reformulation: The treated the related problem \begin{equation*} \min_{x,y} \|x\|_1 \text{such that} \|y\|_1\leq\alpha, x+Uy=z_0 \end{equation*} (which I would call this the Ivanov version of the Tikhonov-problem (1)). This allows for precise exploitation of prior knowledge by assuming that the number {\alpha_0 = \|y_0\|_1} is known.

    First I wondered if this reformulation was responsible for their unusual sharp results (sharper the results for exact recovery by BDPN), but I think it’s not. I think this is due to the fact that they have this strong assumption on the “residual”, namely that it is sparse. This can be formulated with the help of {1}-norm (which is “non-smooth”) in contrast to the smooth {2}-norm which is what one gets as prior for Gaussian noise. Moreover, McCoy and Tropp generalized their result to the case in which the structure of {x_0} and {y_0} is formulated by two functionals {f} and {g}, respectively. Assuming a kind of non-smoothness of {f} and {g} the obtain the same kind of results and especially matrix decomposition problems are covered.

The second day of ISMP started (for me) with the session I organized and chaired.

The first talk was by Michael Goldman on Continuous Primal-Dual Methods in Image Processing. He considered the continuous Arrow-Hurwitz method for saddle point problems

\displaystyle \min_{u}\max_{\xi} K(u,\xi)

with {K} convex in the first and concave in the second variable. The continuous Arrow-Hurwitz method consists of solving

\displaystyle \begin{array}{rcl} \partial_t u(t) &=& -\nabla_u K(u(t),\xi(t))\\ \partial_t \xi(t) &=& \nabla_\xi K(u(t),\xi(t)). \end{array}

His talk evolved around the problem if {K} comes from a functional which contains the total variation, namely he considered

\displaystyle K(u,\xi) = -\int_\Omega u\text{div}(\xi) + G(u)

with the additional constraints {\xi\in C^1_C(\Omega,{\mathbb R}^2)} and {|\xi|\leq 1}. For the case of {G(u) = \lambda\|u-f\|^2/2} he presented a nice analysis of the problem including convergence of the method to a solution of the primal problem and some a-posteriori estimates. This reminded me of Showalters method for the regularization of ill-posed problems. The Arrow-Hurwitz method looks like a regularized version of Showalters method and hence, early stopping does not seem to be necessary for regularization. The related paper is Continuous Primal-Dual Methods for Image Processing.

The second talk was given by Elias Helou and was on Incremental Algorithms for Convex Non-Smooth Optimization with Applications in Image Reconstructions. He presented his work on a very general framework for problems of the class

\displaystyle \min_{x\in X} f(x)

with a convex function {f} and a convex set {X}. Basically, he abstracted the properties of the projected subgradient method. This consists of taking subgradient descent steps for {f} followed by projection onto {X} iteratively: With a subgradient {g^n\in\partial f(x^n)} this reads as

\displaystyle x^{n+1} = P_X(x^n -\alpha_n g^n)

he extracted the conditions one needs from the subgradient descent step and from the projection step and formulated an algorithm which consists of successive application of an “optimality operator” {\mathcal{O}_f} (replacing the subgradient step) and a feasibility operator {\mathcal{F}_X} (replacing the projection step). The algorithm then reads as

\displaystyle \begin{array}{rcl} x^{n+1/2} &=& \mathcal{O}_f(x^n,\alpha_n)\\ x^{n+1} &=& \mathcal{F}_x(x^{n+1/2} \end{array}

and he showed convergence under the extracted conditions. The related paper is , Incremental Subgradients for Constrained Convex Optimization: a Unified Framework and New Methods.

The third talk was by Jerome Fehrenbach on Stripes removel in images, apllications in microscopy. He considered the problem of very specific noise which is appear in the form of stripes (and appears, for example, “single plane illumination microscopy”). In fact he considered a little more general case and the model he proposed was as follows: The observed image is

\displaystyle u_{\text{OBS}} = u + n,

i.e. the usual sum of the true image {u} and noise {n}. However, for the noise he assumed that it is given by

\displaystyle n = \sum_{j=1}^m \psi_j*\lambda_j,

i.e. it is a sum of different convolutions. The {\psi_j} are kind of shape-functions which describe the “pattern of the noise” and the {\lambda_j} are samples of noise processes, following specific distributions (could be white noise realizations, impulsive noise or something else)-. He then formulated a variational method to identify the variables {\lambda_j} which reads as

\displaystyle \min \|\nabla(u_{\text{OBS}} - \sum_{j=1}^m \psi_j*\lambda_j)\|_1 + \sum_j \phi_j(\lambda_j).

Basically, this is the usual variational approach to image denoising, but nor the optimization variable is the noise rather than the image. This is due to the fact that the noise has a specific complicated structure and the usual formulation with {u = u_{\text{OBS}} +n} is not feasible. He used the primal-dual algorithm by Chambolle and Pock for this problem and showed that the method works well on real world problems.

Another theme which caught my attention here is “optimization with variational inequalities as constraints”. At first glance that sounds pretty awkward. Variational inequalities can be quite complicated things and why on earth would somebody considers these things as side conditions in optimization problems? In fact there are good reasons to do so. One reason is, if you have to deal with bi-level optimization problems. Consider an optimization problem

\displaystyle \min_{x\in C} F(x,p) \ \ \ \ \ (1)

 

with convex {C} and {F(\cdot,p)} (omitting regularity conditions which could be necessary to impose) depending on a parameter {p}. Now consider the case that you want to choose the parameter {p} in an optimal way, i.e. it solves another optimization problem. This could look like

\displaystyle \min_p G(x)\quad\text{s.t.}\ x\ \text{solves (1)}. \ \ \ \ \ (2)

 

Now you have an optimization problem as a constraint. Now we use the optimality condition for the problem~(1): For differentiable {F}, {x^*} solves~(1) if and only if

\displaystyle \forall y\in C:\ \nabla_x F(x^*(p),p)(y-x^*(p))\geq 0.

In other words: We con reformulate (2) as

\displaystyle \min_p G(x)\quad\text{s.t.}\ \forall y\in C:\ \nabla_x F(x^*(p),p)(y-x^*(p))\geq 0. \ \ \ \ \ (3)

 

And there it is, our optimization problem with a variational inequality as constraint. Here at ISMP there are entire sessions devoted to this, see here and here.

Today I arrived at ISMP in Berlin. This seems to be the largest conference on optimization and is hosted by the Mathematical Optimization Society. (As a side note: The society recently changed its name from MPS (Mathematical Programming Society) to MOS. Probably the conference will be called ISMO in a few years…).

The reception today was special in comparison to conference receptions I have attended so far. First, it was held in the Konzerthaus which is a pretty fancy neo-classical building. Well, I’ve been to equally fancy buildings at conference receptions at GAMM or AIP conferences already, but the distinguishing feature this evening was the program. As usual it featured welcome notes by important people (notably, the one by the official government representative was accurate and entertaining!), prizes and music. The music was great, the host (G.M. Ziegler) did a great job and the ceremony felt like a show rather than an opening reception.

From the prices I’d like to mention two:

After this reception I am looking even more forward to the rest of this conference.

As I a side note: Something seems to be wrong with me and optimization conferences. It seems like every time I visit such a conference, I am loosing my cell-phone. Happened to me at SIOPT 2011 in Darmstadt and happened to me again today…

In this post I just collect a few papers that caught my attention in the last moth.

I begin with Estimating Unknown Sparsity in Compressed Sensing by Miles E. Lopes. The abstract reads

Within the framework of compressed sensing, many theoretical guarantees for signal reconstruction require that the number of linear measurements {n} exceed the sparsity {\|x\|_0} of the unknown signal {x\in\mathbb{R}^p}. However, if the sparsity {\|x\|_0} is unknown, the choice of {n} remains problematic. This paper considers the problem of estimating the unknown degree of sparsity of {x} with only a small number of linear measurements. Although we show that estimation of {\|x\|_0} is generally intractable in this framework, we consider an alternative measure of sparsity {s(x):=\frac{\|x\|_1^2}{\|x\|_2^2}}, which is a sharp lower bound on {\|x\|_0}, and is more amenable to estimation. When {x} is a non-negative vector, we propose a computationally efficient estimator {\hat{s}(x)}, and use non-asymptotic methods to bound the relative error of {\hat{s}(x)} in terms of a finite number of measurements. Remarkably, the quality of estimation is dimension-free, which ensures that {\hat{s}(x)} is well-suited to the high-dimensional regime where {n<<p}. These results also extend naturally to the problem of using linear measurements to estimate the rank of a positive semi-definite matrix, or the sparsity of a non-negative matrix. Finally, we show that if no structural assumption (such as non-negativity) is made on the signal {x}, then the quantity {s(x)} cannot generally be estimated when {n<<p}.

It’s a nice combination of the observation that the quotient {s(x)} is a sharp lower bound for {\|x\|_0} and that it is possible to estimate the one-norm and the two norm of a vector {x} (with additional properties) from carefully chosen measurements. For a non-negative vector {x} you just measure with the constant-one vector which (in a noisy environment) gives you an estimate of {\|x\|_1}. Similarly, measuring with Gaussian random vector you can obtain an estimate of {\|x\|_2}.

Then there is the dissertation of Dustin Mixon on the arxiv: Sparse Signal Processing with Frame Theory which is well worth reading but too long to provide a short overview. Here is the abstract:

Many emerging applications involve sparse signals, and their processing is a subject of active research. We desire a large class of sensing matrices which allow the user to discern important properties of the measured sparse signal. Of particular interest are matrices with the restricted isometry property (RIP). RIP matrices are known to enable efficient and stable reconstruction of sfficiently sparse signals, but the deterministic construction of such matrices has proven very dfficult. In this thesis, we discuss this matrix design problem in the context of a growing field of study known as frame theory. In the first two chapters, we build large families of equiangular tight frames and full spark frames, and we discuss their relationship to RIP matrices as well as their utility in other aspects of sparse signal processing. In Chapter 3, we pave the road to deterministic RIP matrices, evaluating various techniques to demonstrate RIP, and making interesting connections with graph theory and number theory. We conclude in Chapter 4 with a coherence-based alternative to RIP, which provides near-optimal probabilistic guarantees for various aspects of sparse signal processing while at the same time admitting a whole host of deterministic constructions.

By the way, the thesis is dedicated “To all those who never dedicated a dissertation to themselves.”

Further we have Proximal Newton-type Methods for Minimizing Convex Objective Functions in Composite Form by Jason D Lee, Yuekai Sun, Michael A. Saunders. This paper extends the well explored first order methods for problem of the type {\min g(x) + h(x)} with Lipschitz-differentiable {g} or simple {\mathrm{prox}_h} to second order Newton-type methods. The abstract reads

We consider minimizing convex objective functions in composite form

\displaystyle \min_{x\in\mathbb{R}^n} f(x) := g(x) + h(x)

where {g} is convex and twice-continuously differentiable and {h:\mathbb{R}^n\rightarrow\mathbb{R}} is a convex but not necessarily differentiable function whose proximal mapping can be evaluated efficiently. We derive a generalization of Newton-type methods to handle such convex but nonsmooth objective functions. Many problems of relevance in high-dimensional statistics, machine learning, and signal processing can be formulated in composite form. We prove such methods are globally convergent to a minimizer and achieve quadratic rates of convergence in the vicinity of a unique minimizer. We also demonstrate the performance of such methods using problems of relevance in machine learning and high-dimensional statistics.

With this post I say goodbye for a few weeks of holiday.

In my previous post I announced the draft of the paper “Infeasible-Point Subgradient Algorithm and Computational Solver Comparison for l1-Minimization” which is now available as a preprint at optimization online.

1. Fixed bugs; different results

Basically not much has changed from the draft to the preprint, however, we had to fix some bugs in our computational comparison of solvers and this changed the results. For example, {\ell^1}-magic is now a little better, especially when combined with the heuristic support evaluation (HSE) we propose in the paper. But most notable, {\ell^1}-Homotopy is not the winner anymore. This is due to the fact that we had a conceptual error in our setup. Remember, that {\ell^1}-Homotopy solves that Basis Pursuit denoising problem

\displaystyle  \min_x \frac12\|Ax-b\|_2^2 + \lambda\|x\|_1

starting with {\lambda = \|A^Tb\|_\infty} (which results in {x=0}) and decreases {\lambda} while tracking the (piecewise linear) solution path. Provable this reaches the Basis Pursuit solution for {\lambda=0} after crossing a finite number of breaks in the solution path. However, in our first experiments we used a final parameter of {\lambda = 10^{-9}}. And that was against our rules: We only considered solvers which (in theory) calculate the exact Basis Pursuit solution. Now we reran the calculations with {\lambda=0} and surprisingly the results were worse in terms of reconstruction accuracy (of course, also in terms of speed). We did not precisely found out which part of the solver is responsible for this effect, but it should have something to do with the accuracy of the inverse of the submatrix of {A^TA} which is maintained throughout the iterations.

Another surprise was that the results for {\lambda=10^{-9}} always ended with an approximate solution accuracy (about {10^{-8}}) for all test instances (no matter what size, matrix type or number of nonzeros we used). That is a surprise because there is no formula which tells you in advance how accurate the Basis Pursuit denoising for a particular {\lambda} will be (compared to the Basis Pursuit solution). Maybe an explanation lies is the common features all our test instances share: All matrix columns are normalized to unit Euclidean norm and all non-zero entries in the solutions follow the same distribution.

If you want to have a closer look on our results you can find all the data (i.e. all the running times and solution accuracies for all solvers and all instances) on our SPEAR project website, here.

By the way: Now the overall winner is CPLEX (using the dual simplex method)! So, please stop carrying the message that standard LP solvers are not good for Basis Pursuit…

2. Testset online!

With the submission of the paper, we also made our testset publicly available. You can download all our test instances the website of our SPEAR project both as Matlab .mat files or as ASCII-data (if you would like to use another language). Remember: Each instance comes with a matrix {A}, a vector {b} and a vector {x} which is guaranteed to be the unique solution of {Ax=b} with minimal one-norm. Moreover, there are instance for which the support of the solution is that large, that the minimal-one-norm solution is not necessarily the sparsest solution anymore which is also an interesting borderline case for most Basis Pursuit solvers.

3. ISAL1 online

Also, the Matlab code of ISAL1 (infeasible point subgradient algorithm for {\ell^1}) is online at the website of our SPEAR project. Check it out if you like.

L1TestPack has just been updated to version 1.1. With the help of Andreas Tillmann I enhanced this small gadget for issues related to {\ell^1} minimization. New functions are

  • Routines to directly calculate a source element for a given matrix {A} and a vector {x^\dagger}, that is, calculate a vector {y} such that

    \displaystyle  A^* y \in\partial\|x^\dagger\|_1.

    The existence of such a vector {y} ensures that the minimization problem (the Basis Pursuit problem)

    \displaystyle  \min_x \|x\|_1\ \text{ s.t. }\ Ax = Ax^\dagger

    has the unique solution {x^\dagger} (is other words: {x^\dagger} is recovered exactly). This is particularly helpful is you are interested in unique solutions for Basis pursuit without posing strong conditions which even imply {\ell^0}-{\ell^1}-equivalence.

  • Routines related to RIP constants, the ERC coefficient of Joel Tropp and the mutual coherence.
  • An implementation of the heuristic support evaluation HSE (also described in my previous post). (By the way: We were tempted to call this device “support evaluation routine” with acronym SuppER but abandoned this idea.)

I used to work on “non-convex” regularization with {\ell^p}-penalties, that is, studying the Tikhonov functional

\displaystyle \frac12 \|Ax-b\|_2^2 + \alpha\sum_{i}|x_i|^p \ \ \ \ \ (1)

 

with a linear operator {A} and {0<p<1}.

The regularization properties are quite nice as shown by Markus Grasmair in “Well-posedness and convergence rates for sparse regularization with sublinear {l^q} penalty term” and “Non-convex sparse regularisation” and Kristian Bredies and myself in “Regularization with non-convex separable constraints”.

The next important issue is, to have some way to calculate global minimizers for~(1). But, well, this task may be hard, if not hopeless: Of course one expects a whole lot of local minimizers.

It is quite instructive to consider the simple case in which {A} is the identity first:

Example 1 Consider the minimization of

\displaystyle F(x) = \frac12\|x-b\|_2^2 + \alpha\sum_i |x_i|^p. \ \ \ \ \ (2)

 

This problem separates over the coordinates and hence, can be solved by solving the one-dimensional minimization problem

\displaystyle s^*\in\textup{arg}\min_s \frac12 (s-b)^2 + \alpha|s|^p. \ \ \ \ \ (3)

 

We observe:

  • For {b\geq 0} we get {s^*\geq 0}.
  • Replacing {b} by {-b} leads to {-s^*} instead of {s^*}.

Hence, we can reduce the problem to: For {b\geq 0} find

\displaystyle s^* \in\textup{arg}\min_{s\geq 0} \frac12 (s-b)^2 + \alpha\, s^p. \ \ \ \ \ (4)

 

One local minimizer is always {s^*=0} since the growth of the {p}-th power beats the term {(\cdot-b)^2}. Then, {b} is large enough, there are two more extrema for~(4) which are given as the solutions to

\displaystyle s + \alpha p s^{p-1} = b

one of which is a local maximum (the one which is smaller in magnitude) and the other is a local minimum (the one which is larger in magnitude). This is illustrated in the following “bifurcation” picture:

Now the challenge is, to find out which local minimum has the smaller value. And here a strange thing happens: The “switching point” for {b} at which the global minimizer jumps from {0} to the upper branch of the (multivalued) inverse of {s\mapsto s + \alpha p s^{p-1}} is not at the place at which the second local minimum occurs. It is a little bit larger: In “Convergence rates and source conditions for Tikhonov regularization with sparsity constraints” I calculated this “jumping point” the as the weird expression

\displaystyle b^* = \frac{2-p}{2-2p}\Bigl(2\alpha(1-p)\Bigr)^{\frac{1}{2-p}}.

This leads to the following picture of the mapping

\displaystyle b^\mapsto \textup{arg}\min_s \frac12 (s-b)^2 + \alpha|s|^p

1. Iterative re-weighting

One approach to calculate minimizers in~(1) is the so called iterative re-weighting, which appeared at least in “An unconstrained {\ell^q} minimization for sparse solution of under determined linear systems” by Ming-Jun Lai and Jingyue Wang but is probably older. I think for the problem with equality constraints

\displaystyle \min \|x\|_q\ \textup{ s.t. }\ Ax=b

the approach at least dates back to the 80s but I forgot the reference… For the minimization of (1) the approach goes as follows: For {0<p<1} choose a {q\geq 1} and a small {\epsilon>0} and rewrite the {p}-quasi-norm as

\displaystyle \sum_i |x_i|^p \approx \sum_i (\epsilon + |x_i|^q)^{\frac{p}{q}}.

The necessary condition for a minimizer of

\displaystyle \frac12\|Ax-b\|_2^2 + \alpha\sum_i (\epsilon + |x_i|^q)^{\frac{p}{q}}

is (formally take the derivative)

\displaystyle 0 = \alpha \Big[\frac{p}{q} (\epsilon + |x_i|^q)^{\frac{p}{q}-1} q \textup{sgn}(x_i) |x_i|^{q-1}\Big]_i + A^*(Ax-b)

Note that the exponent {\frac{p}{q}-1} is negative (which is also a reason for the introduction of the small {\epsilon}). Aiming at an iteration, we fix some of the {x}‘s and try to solve for others: If we have a current iterate {x^k} we try to find {x^{k+1}} by solving

\displaystyle 0 = \alpha \Big[\frac{p}{q} (\epsilon + |x_i^k|^q)^{\frac{p}{q}-1} q \textup{sgn}(x_i) |x_i|^{q-1}\Big]_i + A^*(Ax-b)

for {x}. This is the necessary condition for another minimization problem which involves a weighted {q}-norm: Define (non-negative) weights {w^k_i = \frac{p}{q} (\epsilon + |x^k_i|^p)^{\frac{p}{q}-1}} an iterate

\displaystyle x^{k+1}\in \textup{arg}\min_x \frac12\|Ax-b\|_2^2 + \alpha\sum_i w_i^k |x_i|^q. \ \ \ \ \ (5)

 

Lai and Wang do this for {q=2} which has the benefit that each iteration can be done by solving a linear system. However, for general {1\leq q\leq 2} each iteration is still a convex minimization problem. The paper “Convergence of Reweighted {\ell^1} Minimization Algorithms and Unique Solution of Truncated {\ell^p} Minimization” by Xiaojun Chen and Weijun Zhou uses {q=1} and both papers deliver some theoretical results of the iteration. Indeed in both cases one can show (subsequential) convergence to a critical point.

Of course the question arises if there is a chance that the limit will be a global minimizer. Unfortunately this is not probable as a simple numerical experiment shows:

Example 2 We apply the iteration (5) to the one dimensional problem (3) in which we know the solution. And we do this for many values of {b} and plot the value of {b} against the limit of the iteration. Good news first: Everything converges nicely to critical points as deserved. Even better: {\epsilon} can be really small—machine precision works. The bad news: The limit depends on the initial value. Even worse: The method frequently ends on “the wrong branch”, i.e. in the local minimum which is not global. I made the following experiment: I took {p=1/2}, set {\alpha=1} and chose {q=2}. First I initialized for all values of {b} with {s^0=1}. This produced the following output (I plotted every fifth iteration):

Well, the iteration always chose the upper branch… In a second experiment I initialized with a smaller value, namely with {s^0=0.1} for all {b}. This gave:

That’s interesting: I ended at the upper branch for all values below the point where the lower branch (the one with the local maximum) crosses the initialization line. This seems to be true in general. Starting with {s^0=0.05} gave
Well, probably this is not too interesting: Starting “below the local maximum” will bring you to the local minimum which is lower and vice versa. Indeed Lai and Wang proved in their Theorem 2.5 that for a specific setting ({A} of completely full rank, sparsity high enough) there is an {\alpha} small enough such that the method will pick the global minimizer. But wait—they do not say anything about initialization… What happens if we initialize with zero? I have to check…

By the way: A similar experiment as in this example with different values of {q\geq 1} showed the same behavior (getting the right branch if the initialization is ok). However: smaller {q} gave much faster convergence. But remember: For {q=1} (experimentally the fastest) each iteration is an {\ell^1} penalized problem while for {q=2} one has to solve a linear system. So there seems to be a tradeoff between “small number of iterations in IRLP” and “complexity of the subproblems”.

2. Iterative thresholding

Together with Kristian Bredies I developed another approach to these nasty non-convex minimization problems with {\ell^p}-quasi-norms. We wrote a preprint back in 2009 which is currently under revision. Moreover, we always worked in a Hilbert space setting that is {A} maps the sequence space {\ell^2} into a separable Hilbert space.

Remark 1 When showing result for problems in separable Hilbert space I sometimes get the impression that others think this is somehow pointless since in the end one always works with {{\mathbb R}^N} in practice. However, I think that working directly in a separable Hilbert space is preferable since then one can be sure that the results will not depend on the dimension {N} in any nasty way.

Basically our approach was, to use one of the most popular approaches to the {\ell^1}-penalized problem: Iterative thresholding aka forward-backward splitting aka generalized gradient projection. I prefer the last motivation: Consider the minimization of a smooth function {F} over a convex set {C}

\displaystyle \min_{x\in C} F(x)

by the projected gradient method. That is: do a gradient step and use the projection {P_C} to project back onto {C}:

\displaystyle x^{n+1} = P_C(x^n - s_n \nabla F(x^n)).

Now note that the projection is characterized by

\displaystyle P_C(x) = \textup{arg}\min_{y\in C}\frac{1}{2}\|y-x\|^2.

Now we replace the “convex constraint” {C} by a penalty function {\alpha R}, i.e. we want to solve

\displaystyle \min_x F(x) + \alpha R(x).

Then, we just replace the minimization problem for the projection with

\displaystyle P_s(x) = \textup{arg}\min_{y}\frac{1}{2}\|y-x\|^2 + s\alpha R(y)

and iterate

\displaystyle x^{n+1} = P_{s_n}(x^n - s_n \nabla F (x^n)).

The crucial thing is, that one needs global minimizers to obtain {P_s}. However, for these {\ell^p} penalties with {0<p<1} these are available as we have seem in Example~1. Hence, the algorithm can be applied in the case

\displaystyle F(x) = \tfrac{1}{2}\|Ax-y\|^2,\qquad R(x) = \sum_i |x_i|^p.

One easily proves that one gets descent of the objective functional:

Lemma 1 Let {F} be weakly lower semicontinuous and differentiable with Lipschitz continuous gradient {\nabla F} with Lipschitz constant {L} and let {R} be weakly lower semicontinuous and coercive. Furthermore let {P_s(x)} denote any solution of

\displaystyle \min_y \tfrac{1}{2}\|y-x\|^2 + s\alpha R(y).

Then for {y = P_s(x - s\nabla F(x))} it holds that

\displaystyle F(y) + \alpha R(y) \leq F(x) + \alpha R(x) - \tfrac{1}{2}\big(\tfrac{1}{s} - L\big)\|y-x\|^2.

Proof: Start with the minimizing property

\displaystyle \tfrac{1}{2}\|y - (x- s\nabla F(x))\|^2 + s\alpha R(y) \leq \tfrac{1}{2}\|s\nabla F(x)\|^2 + s\alpha R(x).

and conclude (by rearranging, expanding the norm-square, canceling terms and adding {F(y) - F(x)} to both sides) that

\displaystyle (F+\alpha R)(y) - (F+\alpha R)(x) \leq F(y) - F(x) - \langle \nabla F(x),y-x\rangle - \tfrac{1}{2s}\|y-x\|^2.

Finally, use Lipschitz-continuity of {\nabla F} to conclude

\displaystyle F(y) - F(x) - \langle \nabla F(x),y-x\rangle \leq \tfrac{L}{2}\|x-y\|^2.

\Box

This gives descent of the functional value as long as {0< s < 1/L}. Now starts the hard part of the investigation: Under what circumstances do we get convergence and what are possible limits?

To make a long story short: For {\ell^p}-penalties (and also other non-convex penalties which leave the origin with infinite slope) and fixed step-size {s_n=s} one gets that every subsequence of the iterates has a strong accumulation point which is a fixed point of the iteration for that particular {s} as long as {0< s< 1/L}. Well that’s good, but here’s the bad news: As long as {s<1/L} we do not obtain the global minimizer. That’s for sure: Consider {F(x) = \tfrac{1}{2}\|x-b\|^2} and any {0<s<1}

However, with considerably more effort one can show the following: For the iteration {x^{n+1} = P_{s_n}(x^n - s_n \nabla F(x))} with {s_n = (L + 1/n)^{-1}\rightarrow 1/L} (and another technical condition on the Lipschitz constant of {\nabla F}) the iterates have a strong accumulation point which is a solution {x = P_{1/L}(x - 1/L\,\nabla F(x)} and hence, satisfies necessary conditions for a global minimizer.

That’s not too bad yet. Currently Kristian and I, together with Stefan Reiterer, work to show that the whole sequence of iterates converges. Funny enough: This seems to be true for {F(x) = \tfrac{1}{2}\|Ax-b\|^2} and {R(x) = \sum_i |x_i|^p} with rational {p} in {]0,1[}… Basically, Stefan was able to show this with the help of Gröbner bases and this has been my first contact with this nice theory. We hope to finalize our revision soon.

Next Page »

Follow

Get every new post delivered to your Inbox.

Join 30 other followers