I am not an optimizer by training. My road to optimization went through convex analysis. I started with variational methods for inverse problems and mathematical imaging with the goal to derive properties of minimizers of convex functions. Hence, I studied a lot of convex analysis. Later I got interested in how to actually solve convex optimization problems and started to read books about (convex) optimization. At first I was always distracted by the way optimizers treated constraints. To me, a convex optimization problem always looks like

$\displaystyle \min_x F(x).$

Everything can be packed into the convex objective. If you have a convex objective ${f}$ and a constraint ${c(x) \leq 0}$ with a convex function ${c}$, just take ${F = f + I_{\{c\leq 0\}}}$, i.e., add the indicator function of the constraint to the objective (for some strange reason, Wikipedia has the name and notation for indicator and characteristic function the other way round than I, and many others…). . Similarly for multiple constraints ${c_i(x)\leq 0}$ or linear equality constraints ${Ax=b}$ and such.

In this simple world it is particularly easy to characterize all solutions of convex minimization problems: They are just those ${x}$ for which

$\displaystyle 0\in\partial F(x).$

Simple as that. Only take the subgradient of the objective and that’s it.

When reading the optimization books and seeing how difficult the treatment of constraints is there, I was especially puzzled how complicated optimality conditions such as KKT looked like in contrast to ${0\in\partial F(x)}$ and also and by the notion of constraint qualifications.

These constraint qualifications are additional assumptions that are needed to ensure that a minimizer ${x}$ fulfills the KKT-conditions. For example, if one has constraints ${c_i(x)\leq 0}$ then the linear independence constraint qualification (LICQ) states that all the gradients ${\nabla c_i(x)}$ for constraints that are “active” (i.e. ${c_i(x)=0}$) have to be linearly independent.

It took me while to realize that there is a similar issue in my simple “convex analysis view” on optimization: When passing from the gradient of a function to the subgradient, many things stay as they are. But not everything. One thing that does change is the simple sum-rule. If ${F}$ and ${G}$ are differentiable, then ${\nabla(F+G)(x) = \nabla F(x) + \nabla G(x)}$, always. That’s not true for subgradients! You always have that ${\partial F(x) + \partial G(x) \subset \partial(F+G)(x)}$. The reverse inclusion is not always true but holds, e.g., if there is some point for which ${G}$ is finite and ${F}$ is continuous. At first glance this sounds like a very weak assumption. But in fact, this is precisely in the spirit of constraint qualifications!

Take two constraints ${c_1(x)\leq 0}$ and ${c_2(x)\leq 0}$ with convex and differentiable ${c_{1/2}}$. We can express these by ${x\in K_i = \{x\ :\ c_i(x)\leq 0\}}$ (${i=1,2}$). Then it is equivalent to write

$\displaystyle \min_x f(x)\ \text{s.t.}\ c_i(x)\leq 0$

and

$\displaystyle \min_x (f + I_{K_1} + I_{K_2})(x).$

So characterizing solution to either of these is just saying that ${0 \in\partial (f + I_{K_1} + I_{K_2})(x)}$. Oh, there we are: Are we allowed to pull the subgradient apart? We need to apply the sum rule twice and at some point we need that there is a point at which ${I_{K_1}}$ is finite and the other one ${I_{K_2}}$ is continuous (or vice versa)! But an indicator function is only continuous in the interior of the set where it is finite. So the simplest form of the sum rule only holds in the case where only one of two constraints is active! Actually, the sum rule holds in many more cases but it is not always simple to find out if it really holds for some particular case.

So, constraint qualifications are indeed similar to rules that guarantee that a sum rule for subgradients holds.

Geometrically speaking, both shall guarantee that if one “looks at the constraints individually” one still can see what is going on at points of optimality. It may well be that the sum of individual subgradients is too small to get any points with ${0\in \partial F(x) + \partial I_{K_1}(x) + \partial I_{K_2}(x)}$ but still there are solutions to the optimization problem!

As a very simple illustration take the constraints ${K_1 = \{(x,y)\ :\ y\leq 0\}}$ and ${K_2 = \{(x,y)\ :\ y^2\geq x\}}$ in two dimensions. The first constraint says “be in the lower half-plane” while the second says “be above the parabola ${y^2=x}$”. Now take the point ${(0,0)}$ which is on the boundary for both sets. It’s simple to see (geometrically and algebraically) that ${\partial I_{K_1}(0,0) = \{(0,y)\ :\ y\geq 0\}}$ and ${\partial I_{K_2}(0,0) = \{(0,y)\ :\ y\leq 0\}}$, so treating the constraints individually gives ${\partial I_{K_1}(0,0) + \partial I_{K_2}(0,0) = \{(0,y)\ :\ y\in{\mathbb R}\}}$. But the full story is that ${K_1\cap K_2 = \{(0,0)\}}$, thus ${\partial(I_{K_1} + I_{K_2})(0,0) = \partial I_{K_1\cap K_2}(0,0) = {\mathbb R}^2}$ and consequently, the subgradient is much bigger.

There are several answers to the following question:

1. What is a convex set?

For a convex set you probably know these definitions:

Definition 1 A subset ${C}$ of a real vector space ${V}$ is convex if for any ${x,y\in C}$ and ${\lambda\in[0,1]}$ it holds that ${\lambda x + (1-\lambda)y\in C}$.

In other words: If two points lie in the set, then every convex combination also lies in the set.

While this is a “definition from the inside”, convex sets can also be characterized “from the outside”. We add closedness as an assumption and get:

Definition 2 A closed subset ${C}$ of a real locally convex topological vector space ${V}$ is convex if it is the intersection of closed halfspaces (i.e. sets of the form ${\{x\in V\ :\ \langle a,x\rangle\geq c\}}$ for some ${a}$ in the dual space ${V^*}$ and ${c\in {\mathbb R}}$).

Moreover, we could define convex sets via convex functions:

Definition 3 A set ${C\subset V}$ is convex if there is a convex function ${f:V\rightarrow {\mathbb R}}$ such that ${C = \{x\ :\ f(x)\leq 0\}}$.

Of course, this only makes sense once we have defined convex functions. Hence, we could also ask the question:

2. What is a convex function?

We can define a convex function by means of convex sets as follows:

Definition 4 A function ${f:V\rightarrow{\mathbb R}}$ from a real vector space into the real numbers is convex, if its epigraph ${\text{epi} f = \{(x,\mu)\ :\ f(x)\leq \mu\}\subset V\times {\mathbb R}}$ is convex (as a subset of the vector space ${V\times{\mathbb R}}$).

The epigraph consists of the points ${(x,\mu)}$ which lie above the graph of the function and carries the same information as the function.

(Let me note that one can replace the real numbers here and in the following with the extended real numbers ${\bar {\mathbb R} = {\mathbb R}\cup\{-\infty,\infty\}}$ if one uses the right extension of the arithmetic and the obvious ordering, but we do not consider this in this post.)

Because epigraphs are not arbitrary convex sets but have a special form (if a point ${(x,\mu)}$ is in an epigraph, then every ${(x,\lambda)}$ with ${\lambda\geq \mu}$ is also in the epigraph), and because the underlying vector space ${V\times {\mathbb R}}$ comes with an order in the second component, some of the definitions for convex sets from above have a specialized form:

From the definition “convex combinations stay in the set” we get:

Definition 5 A function ${f:V\rightarrow {\mathbb R}}$ is convex, if for all ${x,y\in V}$ and ${\lambda\in [0,1]}$ it holds that

$\displaystyle f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y).$

In other words: The secant of any two points on the graph lies above the graph.

From the definition by “intersection of half spaces” we get another definition. Since we added closedness in this case we add this assumption also here. However, closedness of the epigraph is equivalent to lower-semicontinuity (lsc) of the function and since lsc functions are very convenient we use this notion:

Definition 6 A function ${f:V\rightarrow{\mathbb R}}$ is convex and lsc if it is the pointwise supremum of affine functions, i.e., for some set ${S}$ of tuples ${(a,c)\in V^*\times {\mathbb R}}$ it holds that

$\displaystyle f(x) = \sup_{(a,c) \in S} \langle a,x\rangle + c.$

A special consequence if this definition is that tangents to the graph of a convex function lie below the function. Another important consequence of this fact is that the local behavior of the function, i.e. its tangent plane at some point, carries some information about the global behavior. Especially, the property that the function lies above its tangent planes allows one to conclude that local minima of the function are also global. Probably the last properties are the ones which give convex functions a distinguished role, especially in the field of optimization.

Some of the previous definitions allow for generalizations into several direction and the quest for the abstract core of the notion of convexity has lead to the field of abstract convexity.

3. Abstract convexity

When searching for abstractions of the notion of convexity one may get confused by the various different approaches. For example there are generalized notions of convexity, e.g. for function of spaces of matrices (e.g. rank-one convexity, quasi-convexity or polyconvexity) and there are also further different notions like pseudo-convexity, invexity or another form ofquasi-convexity. Here we do not have generalization in mind but abstraction. Although both things are somehow related, the way of thinking is a bit different. Our aim is not to find a useful notion which is more general than the notion of convexity but to find a formulation which contains the notion of convexity and abstracts away some ingredients which probably not carry the essence of the notion.

In the literature one also finds several approaches in this direction and I mention only my favorite one (and I restrict myself to abstractly convex functions and not write about abstractly convex sets).

To me, the most appealing notion of an abstract convex function is an abstraction of the definition as “pointwise supremum of affine functions”. Let’s look again at the definition:

A function ${f:V\rightarrow {\mathbb R}}$ is convex and lsc if there is a subset ${S}$ of ${V^*\times {\mathbb R}}$ such that

$\displaystyle f(x) = \sup_{(a,c)\in S} \langle a,x\rangle + c.$

We abstract away the vector space structure and hence, also the duality, but keep the real-valuedness (together with its order) and define:

Definition 7 Let ${X}$ be a set and let ${W}$ be a set of real valued function on ${X}$. Then a function ${f:X\rightarrow{\mathbb R}}$ is said to be ${W}$-convex if there is a subset ${S}$ of ${W\times {\mathbb R}}$ such that

$\displaystyle f(x) = \sup_{(w,c)\in S} w(x) + c.$

What we did in this definition was simply to replace continuous affine functions ${x\mapsto \langle a,x\rangle}$ on a vector space by an arbitrary collection of real valued functions ${x\mapsto w(x)}$ on a set. On sees immediately that for every function ${w\in W}$ and any real number ${c}$ the function ${w+c}$ is ${W}$ convex (similarly to the fact that every affine linear function is convex).

Another nice thing about this approach is, that it allows for some notion of duality/conjugation. For ${f:V\rightarrow{\mathbb R}}$ we define the ${W}$-conjugate by

$\displaystyle f^{W*}(w) = \sup_{w\in W} \Big(w(x)- f(x) \Big)$

and we can even formulate a biconjugate

$\displaystyle f^{W**}(x) = \sup_x \Big(w(x) - f^{W*}(w)\Big).$

We naturally have a Fenchel inequality

$\displaystyle w(x) \leq f(x) + f^{W*}(w)$

and we may even define subgradients as usual. Note that a conventional subgradient is an element of the dual space which defines a tangent plane at the point where the subgradient is taken, that is, ${a\in V^*}$ is a subgradient of ${f}$ at ${x}$, if for all ${y\in V}$ it holds that ${f(y) \geq f(x) + \langle a,y-x\rangle}$ or

$\displaystyle f(y) - \langle a,y\rangle\geq f(x) - \langle a,x\rangle.$

A ${W}$-subgradient is an element of ${W}$, namely we define: ${w\in\partial^{W}f(x)}$ if

$\displaystyle \text{for all}\ y:\quad f(y) -w(y) \geq f(x) - w(x).$

Then we also have a Fenchel equality:

$\displaystyle w\in\partial^W f(x)\iff w(x) = f(x) + f^{W*}(w).$

One may also take dualization as the starting point for an abstraction.

4. Abstract conjugation

We could formulate the ${W}$-conjugate as follows: For ${\Phi(w,x) = w(x)}$ we have

$\displaystyle f^{W*}(x) = \sup_w \Big(\Phi(w,x) - f(x)\Big).$

This opens the door to another abstraction: For some sets ${X,W}$ (without any additional structure) define a coupling function ${\Phi: X\times W \rightarrow {\mathbb R}}$ and define the ${\Phi}$-conjugate as

$\displaystyle f^{\Phi*}(w) = \sup_x \Big(\Phi(w,x) - f(x)\Big)$

and the ${\Phi}$-biconjugate as

$\displaystyle f^{\Phi**}(x) = \sup_x \Big(\Phi(w,x) - f^{W*}(w)\Big)$