The fundamental theorem of calculus relates the derivative of a function to the function itself via an integral. A little bit more precise, it says that one can recover a function from its derivative up to an additive constant (on a simply connected domain).

In one space dimension, one can fix some ${x_{0}}$ in the domain (which has to be an interval) and then set

$\displaystyle F(x) = \int_{x_{0}}^{x}f'(t) dt.$

Then ${F(x) = f(x) + c}$ with ${c= - f(x_{0})}$.

Actually, a similar claim is true in higher space dimensions: If ${f}$ is defined on a simply connected domain in ${{\mathbb R}^{d}}$ we can recover ${f}$ from its gradient up to an additive constant as follows: Select some ${x_{0}}$ and set

$\displaystyle F(x) = \int_{\gamma}\nabla f(x)\cdot dx \ \ \ \ \ (1)$

for any path ${\gamma}$ from ${x_{0}}$ to ${x}$. Then it holds under suitable conditions that

$\displaystyle F(x) = f(x) - f(x_{0}).$

And now for something completely different: Convex functions and subgradients.

A function ${f}$ on ${{\mathbb R}^{n}}$ is convex if for every ${x}$ there exists a subgradient ${x^{*}\in{\mathbb R}^{n}}$ such that for all ${y}$ one has the subgradient inequality

$\displaystyle f(y) \geq f(x) + \langle x^{*}, y-x\rangle.$

Writing this down for ${x}$ and ${y}$ with interchanged roles (and ${y^{*}}$ as corresponding subgradient to ${y}$), we see that

$\displaystyle \langle x-y,x^{*}-y^{*}\rangle \geq 0.$

In other words: For a convex function ${f}$ it holds that the subgradient ${\partial f}$ is a (multivalued) monotone mapping. Recall that a multivalued map ${A}$ is monotone, if for every ${y \in A(x)}$ and ${y^{*}\in A(y)}$ it holds that ${\langle x-y,x^{*}-y^{*}\rangle \geq 0}$. It is not hard to see that not every monotone map is actually a subgradient of a convex function (not even, if we go to “maximally monotone maps”, a notion that we sweep under the rug in this post). A simple counterexample is a (singlevalued) linear monotone map represented by

$\displaystyle \begin{bmatrix} 0 & 1\\ -1 & 0 \end{bmatrix}$

(which can not be a subgradient of some ${f}$, since it is not symmetric).

Another hint that monotonicity of a map does not imply that the map is a subgradient is that subgradients have some stronger properties than monotone maps. Let us write down the subgradient inequalities for a number of points ${x_{0},\dots x_{n}}$:

$\displaystyle \begin{array}{rcl} f(x_{1}) & \geq & f(x_{0}) + \langle x_{0}^{*},x_{1}-x_{0}\rangle\\ f(x_{2}) & \geq & f(x_{1}) + \langle x_{1}^{*},x_{2}-x_{1}\rangle\\ \vdots & & \vdots\\ f(x_{n}) & \geq & f(x_{n-1}) + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle\\ f(x_{0}) & \geq & f(x_{n}) + \langle x_{n}^{*},x_{0}-x_{n}\rangle. \end{array}$

If we sum all these up, we obtain

$\displaystyle 0 \geq \langle x_{0}^{*},x_{1}-x_{0}\rangle + \langle x_{1}^{*},x_{2}-x_{1}\rangle + \cdots + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle + \langle x_{n}^{*},x_{0}-x_{n}\rangle.$

This property is called ${n}$-cyclical monotonicity. In these terms we can say that a subgradient of a convex function is cyclical monotone, which means that it is ${n}$-cyclically monotone for every integer ${n}$.

By a remarkable result by Rockafellar, the converse is also true:

Theorem 1 (Rockafellar, 1966) Let ${A}$ by a cyclically monotone map. Then there exists a convex function ${f}$ such that ${A = \partial f}$.

Even more remarkable, the proof is somehow “just an application of the fundamental theorem of calculus” where we recover a function by its subgradient (up to an additive constant that depends on the basepoint).

Proof: we aim to “reconstruct” ${f}$ from ${A = \partial f}$. The trick is to choose some base point ${x_{0}}$ and corresponding ${x_{0}^{*}\in A(x_{0})}$ and set

$\displaystyle \begin{array}{rcl} f(x) &=& \sup\left\{ \langle x_{0}^{*}, x_{1}-x_{0}\rangle + \langle x_{1}^{*},x_{2}-x_{1}\rangle + \cdots + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle\right.\\&& \qquad\left. + \langle x_{n}^{*},x-x_{n}\rangle\right\}\qquad\qquad (0)\end{array}$

where the supremum is taken over all integers ${m}$ and all pairs ${x_{i}^{*}\in A(x_{i})}$ ${i=1,\dots, m}$. As a supremum of affine functions ${f}$ is clearly convex (even lower semicontinuous) and ${f(x_{0}) = 0}$ since ${A}$ is cyclically monotone (this shows that ${f}$ is proper, i.e. not equal to ${\infty}$ everywhere). Finally, for ${\bar x^{*}\in A(\bar x)}$ we have

$\displaystyle \begin{array}{rcl} f(x) &\geq& \sup\left\{ \langle x_{0}^{*}, x_{1}-x_{0}\rangle + \langle x_{1}^{*},x_{2}-x_{1}\rangle + \cdots + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle\right.\\ & & \qquad \left.+ \langle x_{n}^{*},\bar x-x_{n}\rangle + \langle \bar x^{*},\bar x-\bar x\rangle\right\} \end{array}$

with the supremum taken over all integers ${m}$ and all pairs ${x_{i}^{*}\in A(x_{i})}$ ${i=1,\dots, m}$. The right hand side is equal to ${f(x) + \langle \bar x^{*},x-\bar x\rangle}$ and this shows that ${f}$ is indeed convex. $\Box$

Where did we use the fundamental theorem of calculus? Let us have another look at equation~(0). Just for simplicity, let us denote ${x_{i}^{*} =\nabla f(x_{i})}$. Now consider a path ${\gamma}$ from ${x_{0}}$ to ${x}$ and points ${0=t_{0}< t_{1}<\cdots < t_{n}< t_{n+1} = 1}$ with ${\gamma(t_{i}) = x_{i}}$. Then the term inside the supremum of~(0) equals

$\displaystyle \langle \nabla f(\gamma(t_{0})),\gamma(t_{1})-\gamma(t_{0})\rangle + \dots + \langle \nabla f(\gamma(t_{n})),\gamma(t_{n+1})-\gamma(t_{n})\rangle.$

This is Riemannian sum for an integral of the form ${\int_{\gamma}\nabla f(x)\cdot dx}$. By monotonicity of ${f}$, we increase this sum, if we add another point ${\bar t}$ (e.g. ${t_{i}<\bar t, and hence, the supremum does converge to the integral, i.e.~(0) is equal to

$\displaystyle f(x) = \int_{\gamma}\nabla f(x)\cdot dx$

where ${\gamma}$ is any path from ${x_{0}}$ to ${x}$.

$\displaystyle \min_{x}\max_{y}F(x) + \langle Kx,y\rangle - G(y). \ \ \ \ \ (1)$

(where I omit all the standard assumptions, like convexity, continuity ans such…). Fenchel-Rockafellar duality says that solutions are characterized by the inclusion

$\displaystyle 0 \in\left( \begin{bmatrix} \partial F & 0\\ 0 & \partial G \end{bmatrix} + \begin{bmatrix} 0 & K^{T}\\ -K & 0 \end{bmatrix}\right) \begin{bmatrix} x^{*}\\y^{*} \end{bmatrix}$

Noting that the operators

$\displaystyle A = \begin{bmatrix} \partial F & 0\\ 0 & \partial G \end{bmatrix},\quad B = \begin{bmatrix} 0 & K^{T}\\ -K & 0 \end{bmatrix}$

are both monotone, we may apply any of the splitting methods available, for example the Douglas-Rachford method. In terms of resolvents

$\displaystyle R_{tA}(z) := (I+tA)^{-1}(z)$

$\displaystyle \begin{array}{rcl} z^{k+1} & = & R_{tB}(\bar z^{k})\\ \bar z^{k+1}& = & R_{tA}(2z^{k+1}-\bar z^{k}) + \bar z^{k}-z^{k+1}. \end{array}$

For the saddle point problem, this iteration is (with ${z = (x,y)}$)

$\displaystyle \begin{array}{rcl} x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ y^{k+1} &=& R_{t\partial G}(\bar y^{k})\\ \begin{bmatrix} \bar x^{k+1}\\ \bar y^{k+1} \end{bmatrix} & = & \begin{bmatrix} I & tK^{T}\\ -tK & I \end{bmatrix}^{-1} \begin{bmatrix} 2x^{k+1}-\bar x^{k}\\ 2y^{k+1}-\bar y^{k} \end{bmatrix} + \begin{bmatrix} \bar x^{k}- x^{k+1}\\ \bar y^{k}-y^{k+1} \end{bmatrix}. \end{array}$

The first two lines involve proximal steps and we assume that they are simple to implement. The last line, however, involves the solution of a large linear system. This can be broken down to a slightly smaller linear system involving the matrix ${(I+t^{2}K^{T}K)}$ as follows: The linear system equals

$\displaystyle \begin{array}{rcl} \bar x^{k+1} & = & x^{k+1} - tK^{T}(y^{k+1}+\bar y^{k+1}-\bar y^{k})\\ \bar y^{k+1} & = & y^{k+1} + tK(x^{k+1} + \bar x^{k+1}-\bar x^{k}). \end{array}$

Plugging ${\bar y^{k+1}}$ from the second equation into the first gives

$\displaystyle \bar x^{k+1} = x^{k+1} - tK^{T}(2y^{k+1}-\bar y^{k}) - tK^{T}K(x^{k+1}-\bar x^{k+1}-\bar x^{k})$

Denoting ${d^{k+1}= x^{k+1}+\bar x^{k+1}-\bar x^{k}}$ this can be written as

$\displaystyle (I+t^{2}K^{T}K)d^{k+1} = (2x^{k+1}-\bar x^{k}) - tK^{T}(2y^{k+1}-\bar y^{k}).$

and the second equation is just

$\displaystyle \bar y^{k+1} = y^{k+1} + tKd^{k+1}.$

This gives the overall iteration

$\displaystyle \begin{array}{rcl} x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ y^{k+1} &=& R_{t\partial G}(\bar y^{k})\\ d^{k+1} &=& (I+t^{2}K^{T}K)^{-1}(2x^{k+1}-\bar x^{k} - tK(2y^{k+1}-\bar y^{k}))\\ \bar x^{k+1}&=& \bar x^{k}-x^{k+1}+d^{k+1}\\ \bar y^{k+1}&=& y^{k+1}+tKd^{k+1} \end{array}$

This is nothing else than using the Schur complement or factoring as

$\displaystyle \begin{bmatrix} I & tK^{T}\\ -tK & I \end{bmatrix} = \begin{bmatrix} 0 & 0\\ 0 & I \end{bmatrix} + \begin{bmatrix} I\\tK \end{bmatrix} (I + t^{2}K^{T}K)^{-1} \begin{bmatrix} I & -tK^{T} \end{bmatrix}$

and has been applied to imaging problems by O’Connor and Vandenberghe in “Primal-Dual Decomposition by Operator Splitting and Applications to Image Deblurring” (doi). For many problems in imaging, the involved inversion may be fairly easy to perform (if ${K}$ is the image gradient, for example, we only need to solve an equation with an operator like ${(I - t^{2}\Delta)}$ and appropriate boundary conditions). However, there are problems where this inversion is a problem.

I’d like to show the following trick to circumvent the matrix inversion, which I learned from Bredies and Sun’s “Accelerated Douglas-Rachford methods for the solution of convex-concave saddle-point problems”: Here is a slightly different saddle point problem

$\displaystyle \min_{x}\max_{y,x_{p}}F(x) + \langle Kx,y\rangle + \langle Hx,x_{p}\rangle- G(y) - I_{\{0\}}(x_{p}). \ \ \ \ \ (2)$

We added a new dual variable ${x_{p}}$, which is forced to be zero by the additional indicator functional ${I_{\{0\}}}$. Hence, the additional bilinear term ${\langle Hx,x_{p}\rangle}$ is also zero, and we see that ${(x,y)}$ is a solution of (1) if and only if ${(x,y,0)}$ is a solution of (2). In other words: The problem just looks differently, but is, in essence, the same as before.

Now let us write down the Douglas-Rachford iteration for (2). We write this problem as

$\displaystyle \min_{x}\max_{\tilde y} F(x) + \langle \tilde Kx,\tilde y\rangle -\tilde G(\tilde y)$

with

$\displaystyle \tilde y = \begin{bmatrix} y\\x_{p} \end{bmatrix}, \quad \tilde K = \begin{bmatrix} K\\H \end{bmatrix}, \quad \tilde G(\tilde y) = \tilde G(y,x_{p}) = G(y) + I_{\{0\}}(x_{p}).$

Writing down the Douglas-Rachford iteration gives

$\displaystyle \begin{array}{rcl} x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ \tilde y^{k+1} &=& R_{t\partial \tilde G}(\bar{ \tilde y}^{k})\\ \begin{bmatrix} \bar x^{k+1}\\ \bar {\tilde y}^{k+1} \end{bmatrix} & = & \begin{bmatrix} I & t\tilde K^{T}\\ -t\tilde K & I \end{bmatrix}^{-1} \begin{bmatrix} 2x^{k+1}-\bar x^{k}\\ 2\tilde y^{k+1}-\bar {\tilde y}^{k} \end{bmatrix} + \begin{bmatrix} \bar x^{k}- x^{k+1}\\ \bar {\tilde y}^{k}-\tilde y^{k+1} \end{bmatrix}. \end{array}$

Switching back to variables without a tilde, we get, using ${R_{tI_{\{0\}}}(x) = 0}$,

$\displaystyle \begin{array}{rcl} x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ y^{k+1} &=& R_{t\partial \tilde G}(\bar{ y}^{k})\\ x_{p}^{k+1} &=& 0\\ \begin{bmatrix} \bar x^{k+1}\\ \bar {y}^{k+1}\\ \bar x_{p}^{k+1} \end{bmatrix} & = & \begin{bmatrix} I & tK^{T} & tH^{T}\\ -t K & I & 0\\ -t H & 0 & I \end{bmatrix}^{-1} \begin{bmatrix} 2x^{k+1}-\bar x^{k}\\ 2 y^{k+1}-\bar {y}^{k}\\ 2x_{p}^{k+1}-\bar x_{p}^{k} \end{bmatrix} + \begin{bmatrix} \bar x^{k}- x^{k+1}\\ \bar {y}^{k}-y^{k+1}\\ \bar x_{p}^{k}-x_{p}^{k+1} \end{bmatrix}. \end{array}$

First not that ${x_{p}^{k+1}=0}$ throughout the iteration and from the last line of the linear system we get that

$\displaystyle \begin{array}{rcl} -tH\bar x^{k+1} + \bar x_{p}^{k+1} = -\bar x_{p}^{k} -tH(\bar x^{k}-x^{k+1}) + \bar x_{p}^{k} \end{array}$

which implies that

$\displaystyle \bar x_{p}^{k+1} = tH\bar x^{k+1}.$

Thus, both variables ${x_{p}^{k}}$ and ${\bar x_{p}^{k}}$ disappear in the iteration. Now we rewrite the remaining first two lines of the linear system as

$\displaystyle \begin{array}{rcl} \bar x^{k+1} + tK^{T}\bar y^{k+1} + t^{2}H^{T}H\bar x^{k+1} &=& x^{k+1} + tK^{T}(\bar y^{k}-y^{k+1}) + t^{2}H^{T}H\bar x^{k}\\ \bar y^{k+1}-tK\bar x^{k+1} &=& y^{k+1} + tK(x^{k+1}-\bar x^{k}). \end{array}$

Again denoting ${d^{k+1}=x^{k+1}+\bar x^{k+1}-\bar x^{k}}$, solving the second equation for ${\bar y^{k+1}}$ and plugging the result in the first gives

$\displaystyle (I+t^{2}H^{T}H)\bar x^{k+1} +tK^{T}(y^{k+1}+tKd^{k+1}) = x^{k+1}+tK(\bar y^{k}-y^{k+1}) + t^{2}H^{T}H\bar x^{k}.$

To eliminate ${\bar x^{k+1}}$ we add ${(I+t^{2}H^{T}H)(x^{k+1}-\bar x^{k})}$ on both sides and get

$\displaystyle (I+t^{2}(H^{T}H+K^{T}K))d^{k+1} = 2x^{k+1}-\bar x^{k} -tK(y^{k+1}+\bar y^{k+1}-\bar y^{k}) + t^{2}H^{T}Hx^{k+1}.$

In total we obtain the following iteration:

$\displaystyle \begin{array}{rcl} x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ y^{k+1} &=& R_{t\partial G}(\bar y^{k})\\ d^{k+1} &=& (I+t^{2}(H^{T}H + K^{T}K))^{-1}(2x^{k+1}-\bar x^{k} - tK(2y^{k+1}-\bar y^{k}) + t^{2}H^{T}Hx^{k+1})\\ \bar x^{k+1}&=& \bar x^{k}-x^{k+1}+d^{k+1}\\ \bar y^{k+1}&=& y^{k+1}+tKd^{k+1} \end{array}$

and note that only the third line changed.

Since the above works for any matrix ${H}$, we have a lot of freedom. Let us see, that it is even possible to avoid any inversion whatsoever: We would like to choose ${H}$ in a way that ${I+t^{2}(H^{T}H + K^{T}K) = \lambda I}$ for some positive ${\lambda}$. This is equivalent to

$\displaystyle H^{T}H = \tfrac{\lambda-1}{t^{2}}I - K^{T}K.$

As soon as the right hand side is positive definite, Cholesky decomposition shows that such an ${H}$ exists, and this happens if ${\lambda\geq 1+t^{2}\|K\|^{2}}$. Further note, that we do need ${H}$ in any way, but only ${H^{T}H}$, and we can perform the iteration without ever solving any linear system since the third row reads as

$\displaystyle d^{k+1} = \tfrac{1}{\lambda}\left(2x^{k+1}-\bar x^{k} - tK(2y^{k+1}-\bar y^{k}) + ((\lambda-1)I - t^{2}K^{T}K)x^{k+1})\right).$

Here is a lemma that I find myself googling regularly since I always forget it’s exact form.

Lemma 1 Let ${A}$ be a monotone operator, ${\lambda>0}$ and denote by ${R_{\lambda A} = (I+\lambda A)^{-1}}$ the resolvent of ${\lambda A}$. Then it holds that

$\displaystyle \begin{array}{rcl} R_{\lambda A^{-1}}(x) = x - \lambda R_{\lambda^{-1}A}(\lambda^{-1}x). \end{array}$

Proof: We start with the left hand side ${y = R_{\lambda A^{-1}}(x) = (I+\lambda A^{-1})^{-1} x}$ and deduce

$\displaystyle \begin{array}{rcl} x &\in& y + \lambda A^{-1}y\\ \iff \frac{x-y}{\lambda} &\in& A^{-1}y\\ \iff y &\in& A(\frac{x-y}{\lambda})\\ \iff x &\in& A(\frac{x-y}{\lambda}) + x-y\\ \iff \frac{x}{\lambda} &\in& \frac{1}{\lambda}A(\frac{x-y}{\lambda}) + \frac{x-y}{\lambda}\\ \iff \frac{x-y}{\lambda} & = &(I + \lambda^{-1}A)^{-1}(\lambda^{-1}x)\\ \iff x - \lambda (I+\lambda^{-1}A)^{-1}(\lambda^{-1}x) & = & y. \end{array}$

$\Box$

I do not know any official name of this, but would call it Moreau’s identity which is the name of the respective statement for proximal operators for convex functions ${f}$ and ${g}$:

$\displaystyle \begin{array}{rcl} \mathrm{prox}_{\lambda f^{*}}(x) = x - \lambda\mathrm{prox}_{\lambda^{-1}f}(\lambda^{-1}x). \end{array}$

The version for monotone operators is Proposition 23.18 in the first edition of Bauschke and Combette’s book “Convex Analysis and Monotone Operator Theory in Hilbert Spaces”.

Last week Christoph Brauer, Andreas Tillmann and myself uploaded the paper A Primal-Dual Homotopy Algorithm for ${\ell_{1}}$-Minimization with ${\ell_{\infty}}$-Constraints to the arXiv (and we missed being the first ever arXiv-paper with a non-trivial five-digit identifier by twenty-something papers…). This paper specifically deals with the optimization problem

$\displaystyle \begin{array}{rcl} \min_{x}\|x\|_{1}\quad\text{s.t.}\quad \|Ax-b\|_{\infty}\leq \delta \end{array}$

where ${A}$ and ${b}$ are a real matrix and vector, respecively, of compatible size. While the related problem with ${\ell_{2}}$ constraint has been addressed quite often (and the penalized problem ${\min_{x}\|x\|_{1} + \tfrac1{2\lambda}\|Ax-b\|_{2}^{2}}$ is even more popular) there is not much code around to solve this specific problem. Obvious candidates are

• Linear optimization: The problem can be recast as a linear program: The constraint is basically linear already (rewriting it with help of the ones vector ${\mathbf{1}}$ as ${Ax\leq \delta\mathbf{1}+b}$, ${-Ax\leq \delta\mathbf{1}-b}$) and for the objective one can, for example, perform a variable split ${x = x^{+}-x^{-}}$, ${x^{-},x^{+}\geq 0}$ and then write ${\|x\|_{1} = \mathbf{1}^{T}x^{+}+ \mathbf{1}^{T}x^{-}}$.
• Splitting methods: The problem is convex problem of the form ${\min_{x}F(x) + G(Ax)}$ with

$\displaystyle \begin{array}{rcl} F(x) & = & \|x\|_{1}\\ G(y) & = & \begin{cases} 0 & \|y-b\|\leq\delta\\ \infty & \text{else.} \end{cases} \end{array}$

and hence, several methods for these kind of problem are available, such as the alternating direction method of multipliers or the Chambolle-Pock algorithm.

The formulation as linear program has the advantage that one can choose among a lot of highly sophisticated software tools and the second has the advantage that the methods are very easy to code, usually in just a few lines. Drawbacks are, that both methods do exploit too much specific structure of the problem.

Application of the problem with ${\ell_{\infty}}$ constraint are, for example:

• The Dantzig selector, a statistical estimation technique, were the problem is

$\displaystyle \begin{array}{rcl} \min_{x}\|x\|_{1}\quad\text{s.t.}\quad \|A^{T}(Ax-b)\|_{\infty}\leq\delta. \end{array}$

• Sparse dequantization as elaborated here by Jacques, Hammond and Fadili and applied here to de-quantizaton of speech signals by Christoph, Timo Gerkmann and myself.

We wanted to see if one of the most efficient methods known for sparse reconstruction with ${\ell_{2}}$ penalty, namely the homotopy method, can be adapted to this case. The homotopy method for ${\min_{x}\|x\|_{1} + \tfrac1{2\lambda}\|Ax-b\|_{2}^{2}}$ builds on the observation that the solution for ${\lambda\geq \|A^{T}b\|_{\infty}}$ is zero and that the set of solutions ${x_{\lambda}}$, parameterized by the parameter ${\lambda}$, is piecewise linear. Hence, on can start from ${\lambda_{0}= \|A^{T}b\|_{\infty}}$, calculate which direction to go, how far the breakpoint is away, go there and start over. I’ve blogged on the homotopy method here already and there you’ll find some links to great software packages, but also the fact that there can be up to exponentially many breakpoints. However, in practice the homotopy method is usually blazingly fast and with some care, can be made numerically stable and accurate, see, e.g. our extensive study here (and here is the optimization online preprint).

The problem with ${\ell_{\infty}}$ constraint seems similar, especially it is clear that for ${\delta = \|b\|_{\infty}}$, ${x=0}$ is a solution. It is also not so difficult to see that there is a piecewise linear path of solutions ${x_{\delta}}$. What is not so clear is, how it can be computed. It turned out, that in this case the whole truth can be seen when the problem is viewed from a primal-dual viewpoint. The associated dual problem is

$\displaystyle \begin{array}{rcl} \max_{y}\ -b^{T}y - \delta\|y\|_{1}\quad\text{s.t.}\quad\|A^{T}y\|_{\infty}\leq\infty \end{array}$

and a pair ${(x^{*},y^{*})}$ is primal and dual optimal if and only if

$\displaystyle \begin{array}{rcl} -A^{T}y^{*}&\in&\text{Sign}(x^{*})\\ Ax^{*}-b & \in & \delta\text{Sign}(y^{*}) \end{array}$

(where ${\text{Sign}}$ denotes the sign function, multivalued at zero, giving ${[-1,1]}$ there). One can note some things from the primal-dual optimality system:

• You may find a direction ${d}$ such that ${(x^{*}+td,y^{*})}$ stays primal-dual optimal for the constraint ${\leq\delta-t}$ for small ${t}$,
• for a fixed primal optimal ${x_{\delta}}$ there may be several dual optimal ${y_{\delta}}$.

On the other hand, it is not that clear which of the probably many dual optimal ${y_{\delta}}$ allows to find a new direction ${d}$ such that ${x_{\delta}+td}$ with stay primal optimal when reducing ${\delta}$. In fact, it turned out that, at a breakpoint, a new dual variable needs to be found to allow for the next jump in the primal variable. So, the solution path is piecewise linear in the primal variable, but piecewise constant in the dual variable (a situation similar to the adaptive inverse scale space method).

What we found is, that some adapted theorem of the alternative (a.k.a. Farkas’ Lemma) allows to calculate the next dual optimal ${y}$ such that a jump in ${x}$ will be possible.

What is more, is that the calculation of a new primal or dual optimal point amounts to solving a linear program (in contrast to a linear system for ${\ell_{2}}$ homotopy). Hence, the trick of up- and downdating a suitable factorization of a suitable matrix to speed up computation does not work. However, one can somehow leverage the special structure of the problem and use a tailored active set method to progress through the path. Our numerical tests indicated is that the resulting method, which we termed ${\ell_{1}}$-Houdini, is able to solve moderately large problems faster than a commercial LP-solver (while also not only solving the given problem, but calculating the whole solution path on the fly) as can be seen from this table from the paper:

The code of $\ell_1$-Houdini is on Christoph’s homepage, you may also reproduce the data in the above table with your own hardware.

I am not an optimizer by training. My road to optimization went through convex analysis. I started with variational methods for inverse problems and mathematical imaging with the goal to derive properties of minimizers of convex functions. Hence, I studied a lot of convex analysis. Later I got interested in how to actually solve convex optimization problems and started to read books about (convex) optimization. At first I was always distracted by the way optimizers treated constraints. To me, a convex optimization problem always looks like

$\displaystyle \min_x F(x).$

Everything can be packed into the convex objective. If you have a convex objective ${f}$ and a constraint ${c(x) \leq 0}$ with a convex function ${c}$, just take ${F = f + I_{\{c\leq 0\}}}$, i.e., add the indicator function of the constraint to the objective (for some strange reason, Wikipedia has the name and notation for indicator and characteristic function the other way round than I, and many others…). . Similarly for multiple constraints ${c_i(x)\leq 0}$ or linear equality constraints ${Ax=b}$ and such.

In this simple world it is particularly easy to characterize all solutions of convex minimization problems: They are just those ${x}$ for which

$\displaystyle 0\in\partial F(x).$

Simple as that. Only take the subgradient of the objective and that’s it.

When reading the optimization books and seeing how difficult the treatment of constraints is there, I was especially puzzled how complicated optimality conditions such as KKT looked like in contrast to ${0\in\partial F(x)}$ and also and by the notion of constraint qualifications.

These constraint qualifications are additional assumptions that are needed to ensure that a minimizer ${x}$ fulfills the KKT-conditions. For example, if one has constraints ${c_i(x)\leq 0}$ then the linear independence constraint qualification (LICQ) states that all the gradients ${\nabla c_i(x)}$ for constraints that are “active” (i.e. ${c_i(x)=0}$) have to be linearly independent.

It took me while to realize that there is a similar issue in my simple “convex analysis view” on optimization: When passing from the gradient of a function to the subgradient, many things stay as they are. But not everything. One thing that does change is the simple sum-rule. If ${F}$ and ${G}$ are differentiable, then ${\nabla(F+G)(x) = \nabla F(x) + \nabla G(x)}$, always. That’s not true for subgradients! You always have that ${\partial F(x) + \partial G(x) \subset \partial(F+G)(x)}$. The reverse inclusion is not always true but holds, e.g., if there is some point for which ${G}$ is finite and ${F}$ is continuous. At first glance this sounds like a very weak assumption. But in fact, this is precisely in the spirit of constraint qualifications!

Take two constraints ${c_1(x)\leq 0}$ and ${c_2(x)\leq 0}$ with convex and differentiable ${c_{1/2}}$. We can express these by ${x\in K_i = \{x\ :\ c_i(x)\leq 0\}}$ (${i=1,2}$). Then it is equivalent to write

$\displaystyle \min_x f(x)\ \text{s.t.}\ c_i(x)\leq 0$

and

$\displaystyle \min_x (f + I_{K_1} + I_{K_2})(x).$

So characterizing solution to either of these is just saying that ${0 \in\partial (f + I_{K_1} + I_{K_2})(x)}$. Oh, there we are: Are we allowed to pull the subgradient apart? We need to apply the sum rule twice and at some point we need that there is a point at which ${I_{K_1}}$ is finite and the other one ${I_{K_2}}$ is continuous (or vice versa)! But an indicator function is only continuous in the interior of the set where it is finite. So the simplest form of the sum rule only holds in the case where only one of two constraints is active! Actually, the sum rule holds in many more cases but it is not always simple to find out if it really holds for some particular case.

So, constraint qualifications are indeed similar to rules that guarantee that a sum rule for subgradients holds.

Geometrically speaking, both shall guarantee that if one “looks at the constraints individually” one still can see what is going on at points of optimality. It may well be that the sum of individual subgradients is too small to get any points with ${0\in \partial F(x) + \partial I_{K_1}(x) + \partial I_{K_2}(x)}$ but still there are solutions to the optimization problem!

As a very simple illustration take the constraints ${K_1 = \{(x,y)\ :\ y\leq 0\}}$ and ${K_2 = \{(x,y)\ :\ y^2\geq x\}}$ in two dimensions. The first constraint says “be in the lower half-plane” while the second says “be above the parabola ${y^2=x}$”. Now take the point ${(0,0)}$ which is on the boundary for both sets. It’s simple to see (geometrically and algebraically) that ${\partial I_{K_1}(0,0) = \{(0,y)\ :\ y\geq 0\}}$ and ${\partial I_{K_2}(0,0) = \{(0,y)\ :\ y\leq 0\}}$, so treating the constraints individually gives ${\partial I_{K_1}(0,0) + \partial I_{K_2}(0,0) = \{(0,y)\ :\ y\in{\mathbb R}\}}$. But the full story is that ${K_1\cap K_2 = \{(0,0)\}}$, thus ${\partial(I_{K_1} + I_{K_2})(0,0) = \partial I_{K_1\cap K_2}(0,0) = {\mathbb R}^2}$ and consequently, the subgradient is much bigger.

There are several answers to the following question:

1. What is a convex set?

For a convex set you probably know these definitions:

Definition 1 A subset ${C}$ of a real vector space ${V}$ is convex if for any ${x,y\in C}$ and ${\lambda\in[0,1]}$ it holds that ${\lambda x + (1-\lambda)y\in C}$.

In other words: If two points lie in the set, then every convex combination also lies in the set.

While this is a “definition from the inside”, convex sets can also be characterized “from the outside”. We add closedness as an assumption and get:

Definition 2 A closed subset ${C}$ of a real locally convex topological vector space ${V}$ is convex if it is the intersection of closed halfspaces (i.e. sets of the form ${\{x\in V\ :\ \langle a,x\rangle\geq c\}}$ for some ${a}$ in the dual space ${V^*}$ and ${c\in {\mathbb R}}$).

Moreover, we could define convex sets via convex functions:

Definition 3 A set ${C\subset V}$ is convex if there is a convex function ${f:V\rightarrow {\mathbb R}}$ such that ${C = \{x\ :\ f(x)\leq 0\}}$.

Of course, this only makes sense once we have defined convex functions. Hence, we could also ask the question:

2. What is a convex function?

We can define a convex function by means of convex sets as follows:

Definition 4 A function ${f:V\rightarrow{\mathbb R}}$ from a real vector space into the real numbers is convex, if its epigraph ${\text{epi} f = \{(x,\mu)\ :\ f(x)\leq \mu\}\subset V\times {\mathbb R}}$ is convex (as a subset of the vector space ${V\times{\mathbb R}}$).

The epigraph consists of the points ${(x,\mu)}$ which lie above the graph of the function and carries the same information as the function.

(Let me note that one can replace the real numbers here and in the following with the extended real numbers ${\bar {\mathbb R} = {\mathbb R}\cup\{-\infty,\infty\}}$ if one uses the right extension of the arithmetic and the obvious ordering, but we do not consider this in this post.)

Because epigraphs are not arbitrary convex sets but have a special form (if a point ${(x,\mu)}$ is in an epigraph, then every ${(x,\lambda)}$ with ${\lambda\geq \mu}$ is also in the epigraph), and because the underlying vector space ${V\times {\mathbb R}}$ comes with an order in the second component, some of the definitions for convex sets from above have a specialized form:

From the definition “convex combinations stay in the set” we get:

Definition 5 A function ${f:V\rightarrow {\mathbb R}}$ is convex, if for all ${x,y\in V}$ and ${\lambda\in [0,1]}$ it holds that

$\displaystyle f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y).$

In other words: The secant of any two points on the graph lies above the graph.

From the definition by “intersection of half spaces” we get another definition. Since we added closedness in this case we add this assumption also here. However, closedness of the epigraph is equivalent to lower-semicontinuity (lsc) of the function and since lsc functions are very convenient we use this notion:

Definition 6 A function ${f:V\rightarrow{\mathbb R}}$ is convex and lsc if it is the pointwise supremum of affine functions, i.e., for some set ${S}$ of tuples ${(a,c)\in V^*\times {\mathbb R}}$ it holds that

$\displaystyle f(x) = \sup_{(a,c) \in S} \langle a,x\rangle + c.$

A special consequence if this definition is that tangents to the graph of a convex function lie below the function. Another important consequence of this fact is that the local behavior of the function, i.e. its tangent plane at some point, carries some information about the global behavior. Especially, the property that the function lies above its tangent planes allows one to conclude that local minima of the function are also global. Probably the last properties are the ones which give convex functions a distinguished role, especially in the field of optimization.

Some of the previous definitions allow for generalizations into several direction and the quest for the abstract core of the notion of convexity has lead to the field of abstract convexity.

3. Abstract convexity

When searching for abstractions of the notion of convexity one may get confused by the various different approaches. For example there are generalized notions of convexity, e.g. for function of spaces of matrices (e.g. rank-one convexity, quasi-convexity or polyconvexity) and there are also further different notions like pseudo-convexity, invexity or another form ofquasi-convexity. Here we do not have generalization in mind but abstraction. Although both things are somehow related, the way of thinking is a bit different. Our aim is not to find a useful notion which is more general than the notion of convexity but to find a formulation which contains the notion of convexity and abstracts away some ingredients which probably not carry the essence of the notion.

In the literature one also finds several approaches in this direction and I mention only my favorite one (and I restrict myself to abstractly convex functions and not write about abstractly convex sets).

To me, the most appealing notion of an abstract convex function is an abstraction of the definition as “pointwise supremum of affine functions”. Let’s look again at the definition:

A function ${f:V\rightarrow {\mathbb R}}$ is convex and lsc if there is a subset ${S}$ of ${V^*\times {\mathbb R}}$ such that

$\displaystyle f(x) = \sup_{(a,c)\in S} \langle a,x\rangle + c.$

We abstract away the vector space structure and hence, also the duality, but keep the real-valuedness (together with its order) and define:

Definition 7 Let ${X}$ be a set and let ${W}$ be a set of real valued function on ${X}$. Then a function ${f:X\rightarrow{\mathbb R}}$ is said to be ${W}$-convex if there is a subset ${S}$ of ${W\times {\mathbb R}}$ such that

$\displaystyle f(x) = \sup_{(w,c)\in S} w(x) + c.$

What we did in this definition was simply to replace continuous affine functions ${x\mapsto \langle a,x\rangle}$ on a vector space by an arbitrary collection of real valued functions ${x\mapsto w(x)}$ on a set. On sees immediately that for every function ${w\in W}$ and any real number ${c}$ the function ${w+c}$ is ${W}$ convex (similarly to the fact that every affine linear function is convex).

Another nice thing about this approach is, that it allows for some notion of duality/conjugation. For ${f:V\rightarrow{\mathbb R}}$ we define the ${W}$-conjugate by

$\displaystyle f^{W*}(w) = \sup_{w\in W} \Big(w(x)- f(x) \Big)$

and we can even formulate a biconjugate

$\displaystyle f^{W**}(x) = \sup_x \Big(w(x) - f^{W*}(w)\Big).$

We naturally have a Fenchel inequality

$\displaystyle w(x) \leq f(x) + f^{W*}(w)$

and we may even define subgradients as usual. Note that a conventional subgradient is an element of the dual space which defines a tangent plane at the point where the subgradient is taken, that is, ${a\in V^*}$ is a subgradient of ${f}$ at ${x}$, if for all ${y\in V}$ it holds that ${f(y) \geq f(x) + \langle a,y-x\rangle}$ or

$\displaystyle f(y) - \langle a,y\rangle\geq f(x) - \langle a,x\rangle.$

A ${W}$-subgradient is an element of ${W}$, namely we define: ${w\in\partial^{W}f(x)}$ if

$\displaystyle \text{for all}\ y:\quad f(y) -w(y) \geq f(x) - w(x).$

Then we also have a Fenchel equality:

$\displaystyle w\in\partial^W f(x)\iff w(x) = f(x) + f^{W*}(w).$

One may also take dualization as the starting point for an abstraction.

4. Abstract conjugation

We could formulate the ${W}$-conjugate as follows: For ${\Phi(w,x) = w(x)}$ we have

$\displaystyle f^{W*}(x) = \sup_w \Big(\Phi(w,x) - f(x)\Big).$

This opens the door to another abstraction: For some sets ${X,W}$ (without any additional structure) define a coupling function ${\Phi: X\times W \rightarrow {\mathbb R}}$ and define the ${\Phi}$-conjugate as

$\displaystyle f^{\Phi*}(w) = \sup_x \Big(\Phi(w,x) - f(x)\Big)$

and the ${\Phi}$-biconjugate as

$\displaystyle f^{\Phi**}(x) = \sup_x \Big(\Phi(w,x) - f^{W*}(w)\Big)$