There are several answers to the following question:

1. What is a convex set?

For a convex set you probably know these definitions:

Definition 1 A subset {C} of a real vector space {V} is convex if for any {x,y\in C} and {\lambda\in[0,1]} it holds that {\lambda x + (1-\lambda)y\in C}.

In other words: If two points lie in the set, then every convex combination also lies in the set.

While this is a “definition from the inside”, convex sets can also be characterized “from the outside”. We add closedness as an assumption and get:

Definition 2 A closed subset {C} of a real locally convex topological vector space {V} is convex if it is the intersection of closed halfspaces (i.e. sets of the form {\{x\in V\ :\ \langle a,x\rangle\geq c\}} for some {a} in the dual space {V^*} and {c\in {\mathbb R}}).

Moreover, we could define convex sets via convex functions:

Definition 3 A set {C\subset V} is convex if there is a convex function {f:V\rightarrow {\mathbb R}} such that {C = \{x\ :\ f(x)\leq 0\}}.

Of course, this only makes sense once we have defined convex functions. Hence, we could also ask the question:

2. What is a convex function?

We can define a convex function by means of convex sets as follows:

Definition 4 A function {f:V\rightarrow{\mathbb R}} from a real vector space into the real numbers is convex, if its epigraph {\text{epi} f = \{(x,\mu)\ :\ f(x)\leq \mu\}\subset V\times {\mathbb R}} is convex (as a subset of the vector space {V\times{\mathbb R}}).

The epigraph consists of the points {(x,\mu)} which lie above the graph of the function and carries the same information as the function.

(Let me note that one can replace the real numbers here and in the following with the extended real numbers {\bar {\mathbb R} = {\mathbb R}\cup\{-\infty,\infty\}} if one uses the right extension of the arithmetic and the obvious ordering, but we do not consider this in this post.)

Because epigraphs are not arbitrary convex sets but have a special form (if a point {(x,\mu)} is in an epigraph, then every {(x,\lambda)} with {\lambda\geq \mu} is also in the epigraph), and because the underlying vector space {V\times {\mathbb R}} comes with an order in the second component, some of the definitions for convex sets from above have a specialized form:

From the definition “convex combinations stay in the set” we get:

Definition 5 A function {f:V\rightarrow {\mathbb R}} is convex, if for all {x,y\in V} and {\lambda\in [0,1]} it holds that

\displaystyle f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y).

In other words: The secant of any two points on the graph lies above the graph.

From the definition by “intersection of half spaces” we get another definition. Since we added closedness in this case we add this assumption also here. However, closedness of the epigraph is equivalent to lower-semicontinuity (lsc) of the function and since lsc functions are very convenient we use this notion:

Definition 6 A function {f:V\rightarrow{\mathbb R}} is convex and lsc if it is the pointwise supremum of affine functions, i.e., for some set {S} of tuples {(a,c)\in V^*\times {\mathbb R}} it holds that

\displaystyle f(x) = \sup_{(a,c) \in S} \langle a,x\rangle + c.

A special consequence if this definition is that tangents to the graph of a convex function lie below the function. Another important consequence of this fact is that the local behavior of the function, i.e. its tangent plane at some point, carries some information about the global behavior. Especially, the property that the function lies above its tangent planes allows one to conclude that local minima of the function are also global. Probably the last properties are the ones which give convex functions a distinguished role, especially in the field of optimization.

Some of the previous definitions allow for generalizations into several direction and the quest for the abstract core of the notion of convexity has lead to the field of abstract convexity.

3. Abstract convexity

When searching for abstractions of the notion of convexity one may get confused by the various different approaches. For example there are generalized notions of convexity, e.g. for function of spaces of matrices (e.g. rank-one convexity, quasi-convexity or polyconvexity) and there are also further different notions like pseudo-convexity, invexity or another form ofquasi-convexity. Here we do not have generalization in mind but abstraction. Although both things are somehow related, the way of thinking is a bit different. Our aim is not to find a useful notion which is more general than the notion of convexity but to find a formulation which contains the notion of convexity and abstracts away some ingredients which probably not carry the essence of the notion.

In the literature one also finds several approaches in this direction and I mention only my favorite one (and I restrict myself to abstractly convex functions and not write about abstractly convex sets).

To me, the most appealing notion of an abstract convex function is an abstraction of the definition as “pointwise supremum of affine functions”. Let’s look again at the definition:

A function {f:V\rightarrow {\mathbb R}} is convex and lsc if there is a subset {S} of {V^*\times {\mathbb R}} such that

\displaystyle f(x) = \sup_{(a,c)\in S} \langle a,x\rangle + c.

We abstract away the vector space structure and hence, also the duality, but keep the real-valuedness (together with its order) and define:

Definition 7 Let {X} be a set and let {W} be a set of real valued function on {X}. Then a function {f:X\rightarrow{\mathbb R}} is said to be {W}-convex if there is a subset {S} of {W\times {\mathbb R}} such that

\displaystyle f(x) = \sup_{(w,c)\in S} w(x) + c.

What we did in this definition was simply to replace continuous affine functions {x\mapsto \langle a,x\rangle} on a vector space by an arbitrary collection of real valued functions {x\mapsto w(x)} on a set. On sees immediately that for every function {w\in W} and any real number {c} the function {w+c} is {W} convex (similarly to the fact that every affine linear function is convex).

Another nice thing about this approach is, that it allows for some notion of duality/conjugation. For {f:V\rightarrow{\mathbb R}} we define the {W}-conjugate by

\displaystyle f^{W*}(w) = \sup_{w\in W} \Big(w(x)- f(x) \Big)

and we can even formulate a biconjugate

\displaystyle f^{W**}(x) = \sup_x \Big(w(x) - f^{W*}(w)\Big).

We naturally have a Fenchel inequality

\displaystyle w(x) \leq f(x) + f^{W*}(w)

and we may even define subgradients as usual. Note that a conventional subgradient is an element of the dual space which defines a tangent plane at the point where the subgradient is taken, that is, {a\in V^*} is a subgradient of {f} at {x}, if for all {y\in V} it holds that {f(y) \geq f(x) + \langle a,y-x\rangle} or

\displaystyle f(y) - \langle a,y\rangle\geq f(x) - \langle a,x\rangle.

A {W}-subgradient is an element of {W}, namely we define: {w\in\partial^{W}f(x)} if

\displaystyle \text{for all}\ y:\quad f(y) -w(y) \geq f(x) - w(x).

Then we also have a Fenchel equality:

\displaystyle w\in\partial^W f(x)\iff w(x) = f(x) + f^{W*}(w).

One may also take dualization as the starting point for an abstraction.

4. Abstract conjugation

We could formulate the {W}-conjugate as follows: For {\Phi(w,x) = w(x)} we have

\displaystyle f^{W*}(x) = \sup_w \Big(\Phi(w,x) - f(x)\Big).

This opens the door to another abstraction: For some sets {X,W} (without any additional structure) define a coupling function {\Phi: X\times W \rightarrow {\mathbb R}} and define the {\Phi}-conjugate as

\displaystyle f^{\Phi*}(w) = \sup_x \Big(\Phi(w,x) - f(x)\Big)

and the {\Phi}-biconjugate as

\displaystyle f^{\Phi**}(x) = \sup_x \Big(\Phi(w,x) - f^{W*}(w)\Big)

ISMP LogoISMP is over now and I’m already home. I do not have many things to report on from the last day. This is not due the lower quality of the talks but due to the fact that I was a little bit exhausted, as usual at the end of a five-day conference. However, I collect a few things for the record:

  • In the morning I visited the semi-planary by Xiaojun Chenon non-convex and non-smooth minimization with smoothing methods. Not surprisingly, she treated the problem

    \displaystyle \min_x f(x) + \|x\|_p^p

    with convex and smooth {f:{\mathbb R}^n\rightarrow{\mathbb R}} and {0<p<1}. She proposed and analyzed smoothing methods, that is, to smooth the problem a bit to obtain a Lipschitz-continuous objective function {\phi_\epsilon}, minimizing this and then gradually decreasing {\epsilon}. This works, as she showed. If I remember correctly, she also treated “iteratively reweighted least squares” as I described in my previous post. Unfortunately, she did not include the generalized forward-backward methods based on {\text{prox}}-functions for non-convex functions. Kristian and I pursued this approach in our paper Minimization of non-smooth, non-convex functionals by iterative thresholding and some special features of our analysis include:

    • A condition which excludes some (but not all) local minimizers from being global.
    • An algorithm which avoids this non-global minimizers by carefully adjusting the steplength of the method.
    • A result that the number of local minimizers is still finite, even if the problem is posed in {\ell^2({\mathbb N})} and not in {{\mathbb R}^n}.

    Most of our results hold true, if the {p}-quasi-norm is replaced by functions of the form

    \displaystyle \sum_n \phi_n(|x_n|)

    with special non-convex {\phi}, namely fulfilling a list of assumptions like

    • {\phi'(x) \rightarrow \infty} for {x\rightarrow 0} (infinite slope at {0}) and {\phi(x)\rightarrow\infty} for {x\rightarrow\infty} (mild coercivity),
    • {\phi'} strictly convex on {]0,\infty[} and {\phi'(x)/x\rightarrow 0} for {x\rightarrow\infty},
    • for each {b>0} there is {a>0} such that for {x<b} it holds that {\phi(x)>ax^2}, and
    • local integrability of some section of {\partial\phi'(x) x}.

    As one easily sees, {p}-quasi-norms fulfill the assumptions and some other interesting functions as well (e.g. some with very steep slope at {0} like {x\mapsto \log(x^{1/3}+1)}).

  • Jorge Nocedalgave a talk on second-order methods for non-smooth problems and his main example was a functional like

    \displaystyle \min_x f(x) + \|x\|_1

    with a convex and smooth {f}, but different from Xiaojun Chen, he only considered the {1}-norm. His talked is among the best planary talks I have ever attended and it was a great pleasure to listen to him. He carefully explained things and put them in perspective. In the case he skipped slides, he made me feel that I either did not miss an important thing, or understood them even though he didn’t show them He argued that it is not necessarily more expensive to use second order information in contrast to first order methods. Indeed, the {1}-norm can be used to reduce the number of degrees of freedom for a second order step. What was pretty interesting is, that he advocated semismooth Newton methods for this problem. Roland and I pursued this approach some time ago in our paper A Semismooth Newton Method for Tikhonov Functionals with Sparsity Constraints and, if I remember correctly (my notes are not complete at this point), his family of methods included our ssn-method. The method Roland and I proposed worked amazingly well in the cases in which it converged but the method suffered from non-global convergence. We had some preliminary ideas for globalization, which we could not tune enough to retain the speed of the method, and abandoned the topic. Now, that the topic will most probably be revived by the community, I am looking forward to fresh ideas here.

I used to work on “non-convex” regularization with {\ell^p}-penalties, that is, studying the Tikhonov functional

\displaystyle \frac12 \|Ax-b\|_2^2 + \alpha\sum_{i}|x_i|^p \ \ \ \ \ (1)

 

with a linear operator {A} and {0<p<1}.

The regularization properties are quite nice as shown by Markus Grasmair in “Well-posedness and convergence rates for sparse regularization with sublinear {l^q} penalty term” and “Non-convex sparse regularisation” and Kristian Bredies and myself in “Regularization with non-convex separable constraints”.

The next important issue is, to have some way to calculate global minimizers for~(1). But, well, this task may be hard, if not hopeless: Of course one expects a whole lot of local minimizers.

It is quite instructive to consider the simple case in which {A} is the identity first:

Example 1 Consider the minimization of

\displaystyle F(x) = \frac12\|x-b\|_2^2 + \alpha\sum_i |x_i|^p. \ \ \ \ \ (2)

 

This problem separates over the coordinates and hence, can be solved by solving the one-dimensional minimization problem

\displaystyle s^*\in\textup{arg}\min_s \frac12 (s-b)^2 + \alpha|s|^p. \ \ \ \ \ (3)

 

We observe:

  • For {b\geq 0} we get {s^*\geq 0}.
  • Replacing {b} by {-b} leads to {-s^*} instead of {s^*}.

Hence, we can reduce the problem to: For {b\geq 0} find

\displaystyle s^* \in\textup{arg}\min_{s\geq 0} \frac12 (s-b)^2 + \alpha\, s^p. \ \ \ \ \ (4)

 

One local minimizer is always {s^*=0} since the growth of the {p}-th power beats the term {(\cdot-b)^2}. Then, {b} is large enough, there are two more extrema for~(4) which are given as the solutions to

\displaystyle s + \alpha p s^{p-1} = b

one of which is a local maximum (the one which is smaller in magnitude) and the other is a local minimum (the one which is larger in magnitude). This is illustrated in the following “bifurcation” picture:

Now the challenge is, to find out which local minimum has the smaller value. And here a strange thing happens: The “switching point” for {b} at which the global minimizer jumps from {0} to the upper branch of the (multivalued) inverse of {s\mapsto s + \alpha p s^{p-1}} is not at the place at which the second local minimum occurs. It is a little bit larger: In “Convergence rates and source conditions for Tikhonov regularization with sparsity constraints” I calculated this “jumping point” the as the weird expression

\displaystyle b^* = \frac{2-p}{2-2p}\Bigl(2\alpha(1-p)\Bigr)^{\frac{1}{2-p}}.

This leads to the following picture of the mapping

\displaystyle b^\mapsto \textup{arg}\min_s \frac12 (s-b)^2 + \alpha|s|^p

1. Iterative re-weighting

One approach to calculate minimizers in~(1) is the so called iterative re-weighting, which appeared at least in “An unconstrained {\ell^q} minimization for sparse solution of under determined linear systems” by Ming-Jun Lai and Jingyue Wang but is probably older. I think for the problem with equality constraints

\displaystyle \min \|x\|_q\ \textup{ s.t. }\ Ax=b

the approach at least dates back to the 80s but I forgot the reference… For the minimization of (1) the approach goes as follows: For {0<p<1} choose a {q\geq 1} and a small {\epsilon>0} and rewrite the {p}-quasi-norm as

\displaystyle \sum_i |x_i|^p \approx \sum_i (\epsilon + |x_i|^q)^{\frac{p}{q}}.

The necessary condition for a minimizer of

\displaystyle \frac12\|Ax-b\|_2^2 + \alpha\sum_i (\epsilon + |x_i|^q)^{\frac{p}{q}}

is (formally take the derivative)

\displaystyle 0 = \alpha \Big[\frac{p}{q} (\epsilon + |x_i|^q)^{\frac{p}{q}-1} q \textup{sgn}(x_i) |x_i|^{q-1}\Big]_i + A^*(Ax-b)

Note that the exponent {\frac{p}{q}-1} is negative (which is also a reason for the introduction of the small {\epsilon}). Aiming at an iteration, we fix some of the {x}‘s and try to solve for others: If we have a current iterate {x^k} we try to find {x^{k+1}} by solving

\displaystyle 0 = \alpha \Big[\frac{p}{q} (\epsilon + |x_i^k|^q)^{\frac{p}{q}-1} q \textup{sgn}(x_i) |x_i|^{q-1}\Big]_i + A^*(Ax-b)

for {x}. This is the necessary condition for another minimization problem which involves a weighted {q}-norm: Define (non-negative) weights {w^k_i = \frac{p}{q} (\epsilon + |x^k_i|^p)^{\frac{p}{q}-1}} an iterate

\displaystyle x^{k+1}\in \textup{arg}\min_x \frac12\|Ax-b\|_2^2 + \alpha\sum_i w_i^k |x_i|^q. \ \ \ \ \ (5)

 

Lai and Wang do this for {q=2} which has the benefit that each iteration can be done by solving a linear system. However, for general {1\leq q\leq 2} each iteration is still a convex minimization problem. The paper “Convergence of Reweighted {\ell^1} Minimization Algorithms and Unique Solution of Truncated {\ell^p} Minimization” by Xiaojun Chen and Weijun Zhou uses {q=1} and both papers deliver some theoretical results of the iteration. Indeed in both cases one can show (subsequential) convergence to a critical point.

Of course the question arises if there is a chance that the limit will be a global minimizer. Unfortunately this is not probable as a simple numerical experiment shows:

Example 2 We apply the iteration (5) to the one dimensional problem (3) in which we know the solution. And we do this for many values of {b} and plot the value of {b} against the limit of the iteration. Good news first: Everything converges nicely to critical points as deserved. Even better: {\epsilon} can be really small—machine precision works. The bad news: The limit depends on the initial value. Even worse: The method frequently ends on “the wrong branch”, i.e. in the local minimum which is not global. I made the following experiment: I took {p=1/2}, set {\alpha=1} and chose {q=2}. First I initialized for all values of {b} with {s^0=1}. This produced the following output (I plotted every fifth iteration):

Well, the iteration always chose the upper branch… In a second experiment I initialized with a smaller value, namely with {s^0=0.1} for all {b}. This gave:

That’s interesting: I ended at the upper branch for all values below the point where the lower branch (the one with the local maximum) crosses the initialization line. This seems to be true in general. Starting with {s^0=0.05} gave
Well, probably this is not too interesting: Starting “below the local maximum” will bring you to the local minimum which is lower and vice versa. Indeed Lai and Wang proved in their Theorem 2.5 that for a specific setting ({A} of completely full rank, sparsity high enough) there is an {\alpha} small enough such that the method will pick the global minimizer. But wait—they do not say anything about initialization… What happens if we initialize with zero? I have to check…

By the way: A similar experiment as in this example with different values of {q\geq 1} showed the same behavior (getting the right branch if the initialization is ok). However: smaller {q} gave much faster convergence. But remember: For {q=1} (experimentally the fastest) each iteration is an {\ell^1} penalized problem while for {q=2} one has to solve a linear system. So there seems to be a tradeoff between “small number of iterations in IRLP” and “complexity of the subproblems”.

2. Iterative thresholding

Together with Kristian Bredies I developed another approach to these nasty non-convex minimization problems with {\ell^p}-quasi-norms. We wrote a preprint back in 2009 which is currently under revision. Moreover, we always worked in a Hilbert space setting that is {A} maps the sequence space {\ell^2} into a separable Hilbert space.

Remark 1 When showing result for problems in separable Hilbert space I sometimes get the impression that others think this is somehow pointless since in the end one always works with {{\mathbb R}^N} in practice. However, I think that working directly in a separable Hilbert space is preferable since then one can be sure that the results will not depend on the dimension {N} in any nasty way.

Basically our approach was, to use one of the most popular approaches to the {\ell^1}-penalized problem: Iterative thresholding aka forward-backward splitting aka generalized gradient projection. I prefer the last motivation: Consider the minimization of a smooth function {F} over a convex set {C}

\displaystyle \min_{x\in C} F(x)

by the projected gradient method. That is: do a gradient step and use the projection {P_C} to project back onto {C}:

\displaystyle x^{n+1} = P_C(x^n - s_n \nabla F(x^n)).

Now note that the projection is characterized by

\displaystyle P_C(x) = \textup{arg}\min_{y\in C}\frac{1}{2}\|y-x\|^2.

Now we replace the “convex constraint” {C} by a penalty function {\alpha R}, i.e. we want to solve

\displaystyle \min_x F(x) + \alpha R(x).

Then, we just replace the minimization problem for the projection with

\displaystyle P_s(x) = \textup{arg}\min_{y}\frac{1}{2}\|y-x\|^2 + s\alpha R(y)

and iterate

\displaystyle x^{n+1} = P_{s_n}(x^n - s_n \nabla F (x^n)).

The crucial thing is, that one needs global minimizers to obtain {P_s}. However, for these {\ell^p} penalties with {0<p<1} these are available as we have seem in Example~1. Hence, the algorithm can be applied in the case

\displaystyle F(x) = \tfrac{1}{2}\|Ax-y\|^2,\qquad R(x) = \sum_i |x_i|^p.

One easily proves that one gets descent of the objective functional:

Lemma 1 Let {F} be weakly lower semicontinuous and differentiable with Lipschitz continuous gradient {\nabla F} with Lipschitz constant {L} and let {R} be weakly lower semicontinuous and coercive. Furthermore let {P_s(x)} denote any solution of

\displaystyle \min_y \tfrac{1}{2}\|y-x\|^2 + s\alpha R(y).

Then for {y = P_s(x - s\nabla F(x))} it holds that

\displaystyle F(y) + \alpha R(y) \leq F(x) + \alpha R(x) - \tfrac{1}{2}\big(\tfrac{1}{s} - L\big)\|y-x\|^2.

Proof: Start with the minimizing property

\displaystyle \tfrac{1}{2}\|y - (x- s\nabla F(x))\|^2 + s\alpha R(y) \leq \tfrac{1}{2}\|s\nabla F(x)\|^2 + s\alpha R(x).

and conclude (by rearranging, expanding the norm-square, canceling terms and adding {F(y) - F(x)} to both sides) that

\displaystyle (F+\alpha R)(y) - (F+\alpha R)(x) \leq F(y) - F(x) - \langle \nabla F(x),y-x\rangle - \tfrac{1}{2s}\|y-x\|^2.

Finally, use Lipschitz-continuity of {\nabla F} to conclude

\displaystyle F(y) - F(x) - \langle \nabla F(x),y-x\rangle \leq \tfrac{L}{2}\|x-y\|^2.

\Box

This gives descent of the functional value as long as {0< s < 1/L}. Now starts the hard part of the investigation: Under what circumstances do we get convergence and what are possible limits?

To make a long story short: For {\ell^p}-penalties (and also other non-convex penalties which leave the origin with infinite slope) and fixed step-size {s_n=s} one gets that every subsequence of the iterates has a strong accumulation point which is a fixed point of the iteration for that particular {s} as long as {0< s< 1/L}. Well that’s good, but here’s the bad news: As long as {s<1/L} we do not obtain the global minimizer. That’s for sure: Consider {F(x) = \tfrac{1}{2}\|x-b\|^2} and any {0<s<1}

However, with considerably more effort one can show the following: For the iteration {x^{n+1} = P_{s_n}(x^n - s_n \nabla F(x))} with {s_n = (L + 1/n)^{-1}\rightarrow 1/L} (and another technical condition on the Lipschitz constant of {\nabla F}) the iterates have a strong accumulation point which is a solution {x = P_{1/L}(x - 1/L\,\nabla F(x)} and hence, satisfies necessary conditions for a global minimizer.

That’s not too bad yet. Currently Kristian and I, together with Stefan Reiterer, work to show that the whole sequence of iterates converges. Funny enough: This seems to be true for {F(x) = \tfrac{1}{2}\|Ax-b\|^2} and {R(x) = \sum_i |x_i|^p} with rational {p} in {]0,1[}… Basically, Stefan was able to show this with the help of Gröbner bases and this has been my first contact with this nice theory. We hope to finalize our revision soon.

Follow

Get every new post delivered to your Inbox.

Join 32 other followers