Consider a convex optimization problem of the form

\displaystyle \begin{array}{rcl} \min_{x}F(x) + G(Ax) \end{array}

with convex {F} and {G} and matrix {A}. (We formulate everything quite loosely, skipping over details like continuity and such, as they are irrelevant for the subject matter). Optimization problems of this type have a specific type of dual problem, namely the Fenchel-Rockafellar dual, which is

\displaystyle \begin{array}{rcl} \max_{y}-F^{*}(-A^{T}y) - G^{*}(y) \end{array}

and under certain regularity conditions it holds that the optimal value of the dual equals the the objective value of the primal and, moreover, that a pair {(x^{*},y^{*})} is both primal and dual optimal if and only if the primal dual gap is zero, i.e. if and only if

\displaystyle \begin{array}{rcl} F(x^{*})+G(Ax^{*}) + F^{*}(-A^{T}y^{*})+G^{*}(y^{*}) = 0. \end{array}

Hence, it is quite handy to use the primal dual gap as a stopping criteria for iterative methods to solve these problems. So, if one runs an algorithm which produces primal iterates {x^{k}} and dual iterates {y^{k}} one can monitor

\displaystyle \begin{array}{rcl} \mathcal{G}(x^{k},y^{k}) = F(x^{k})+G(Ax^{k}) + F^{*}(-A^{T}y^{k})+G^{*}(y^{k}). \end{array}

and stop if the value falls below a desired tolerance.

There is some problem with this approach which appears if the method produces infeasible iterates in the sense that one of the four terms in {\mathcal{G}} is actually {+\infty}. This may be the case if {F} or {G} are not everywhere finite or, loosely speaking, have linear growth in some directions (since then the respective conjugate will not be finite everywhere). In the rest of the post, I’ll sketch a general method that can often solve this particular problem.

For the sake of simplicity, consider the following primal dual algorithm

\displaystyle \begin{array}{rcl} x^{k+1} & = &\mathrm{prox}_{\tau F}(x^{k}-A^{T}y^{k})\\ y^{k+1} & = &\mathrm{prox}_{\sigma G^{*}}(y^{k}+\sigma A(2x^{k+1}-x^{k})) \end{array}

(also know as primal dual hybrid gradient method or Chambolle-Pock’s algorithm). It converges as soon as {\sigma\tau\leq \|A\|^{-2}}.

While the structure of the algorithm ensures that {F(x^{k})} and {G^{*}(y^{k})} are always finite (since always {\mathrm{prox}_{F}(x)\in\mathrm{dom}(F)}), is may be that {F^{*}(-A^{T}y^{k})} or {G(Ax^{k})} are indeed infinite, rendering the primal dual gap useless.

Let us assume that the problematic term is {F^{*}(-A^{T}y^{k})}. Here is a way out in the case where one can deduce some a-priori bounds on {x^{*}}, i.e. a bounded and convex set {C} with {x^{*}\in C}. In fact, this is often the case (e.g. one may know a-priori that there exist lower bounds {l_{i}} and upper bounds {u_{i}}, i.e. it holds that {l_{i}\leq x^{*}_{i}\leq u_{i}}). Then, adding these constraints to the problem will not change the solution.

Let us see, how this changes the primal dual gap: We set {\tilde F(x) = F(x) + I_{C}(x)} where {C} is the set which models the bound constraints. Since {C} is a bounded convex set and {F} is finite on {C}, it is clear that

\displaystyle \begin{array}{rcl} \tilde F^{*}(\xi) = \sup_{x\in C}\,\langle \xi,x\rangle - F(x) \end{array}

is finite for every {\xi}. This leads to a finite duality gap. However, one should also adapt the prox operator. But this is also simple in the case where the constraint {C} and the function {F} are separable, i.e. {C} encodes bound constraints as above (in other words {C = [l_{1},u_{1}]\times\cdots\times [l_{n},u_{n}]}) and

\displaystyle \begin{array}{rcl} F(x) = \sum_{i} f_{i}(x_{i}). \end{array}

Here it holds that

\displaystyle \begin{array}{rcl} \mathrm{prox}_{\sigma \tilde F}(x)_{i} = \mathrm{prox}_{\sigma f_{i} + I_{[l_{i},u_{i}]}}(x_{i}) \end{array}

and it is simple to see that

\displaystyle \begin{array}{rcl} \mathrm{prox}_{\sigma f_{i} + I_{[l_{i},u_{i}]}}(x_{i}) = \mathrm{proj}_{[l_{i},u_{i}]}\mathrm{prox}_{\tau f_{i}}(x_{i}), \end{array}

i.e., only uses the proximal operator of {F} and project onto the constraints. For general {C}, this step may be more complicated.

One example, where this makes sense is {L^{1}-TV} denoising which can be written as

\displaystyle \begin{array}{rcl} \min_{u}\|u-u^{0}\|_{1} + \lambda TV(u). \end{array}

Here we have

\displaystyle \begin{array}{rcl} F(u) = \|u-u^{0}\|_{1},\quad A = \nabla,\quad G(\phi) = I_{|\phi_{ij}|\leq 1}(\phi). \end{array}

The guy that causes problems here is {F^{*}} which is an indicator functional and indeed {A^{T}\phi^{k}} will usually be dual infeasible. But since {u} is an image with a know range of gray values one can simple add the constraints {0\leq u\leq 1} to the problem and obtains a finite dual while still keeping a simple proximal operator. It is quite instructive to compute {\tilde F} in this case.

Last week Christoph Brauer, Andreas Tillmann and myself uploaded the paper A Primal-Dual Homotopy Algorithm for {\ell_{1}}-Minimization with {\ell_{\infty}}-Constraints to the arXiv (and we missed being the first ever arXiv-paper with a non-trivial five-digit identifier by twenty-something papers…). This paper specifically deals with the optimization problem

\displaystyle \begin{array}{rcl} \min_{x}\|x\|_{1}\quad\text{s.t.}\quad \|Ax-b\|_{\infty}\leq \delta \end{array}

where {A} and {b} are a real matrix and vector, respecively, of compatible size. While the related problem with {\ell_{2}} constraint has been addressed quite often (and the penalized problem {\min_{x}\|x\|_{1} + \tfrac1{2\lambda}\|Ax-b\|_{2}^{2}} is even more popular) there is not much code around to solve this specific problem. Obvious candidates are

  • Linear optimization: The problem can be recast as a linear program: The constraint is basically linear already (rewriting it with help of the ones vector {\mathbf{1}} as {Ax\leq \delta\mathbf{1}+b}, {-Ax\leq \delta\mathbf{1}-b}) and for the objective one can, for example, perform a variable split {x = x^{+}-x^{-}}, {x^{-},x^{+}\geq 0} and then write {\|x\|_{1} = \mathbf{1}^{T}x^{+}+ \mathbf{1}^{T}x^{-}}.
  • Splitting methods: The problem is convex problem of the form {\min_{x}F(x) + G(Ax)} with

    \displaystyle \begin{array}{rcl} F(x) & = & \|x\|_{1}\\ G(y) & = & \begin{cases} 0 & \|y-b\|\leq\delta\\ \infty & \text{else.} \end{cases} \end{array}

    and hence, several methods for these kind of problem are available, such as the alternating direction method of multipliers or the Chambolle-Pock algorithm.

The formulation as linear program has the advantage that one can choose among a lot of highly sophisticated software tools and the second has the advantage that the methods are very easy to code, usually in just a few lines. Drawbacks are, that both methods do exploit too much specific structure of the problem.

Application of the problem with {\ell_{\infty}} constraint are, for example:

  • The Dantzig selector, a statistical estimation technique, were the problem is

    \displaystyle \begin{array}{rcl} \min_{x}\|x\|_{1}\quad\text{s.t.}\quad \|A^{T}(Ax-b)\|_{\infty}\leq\delta. \end{array}

  • Sparse dequantization as elaborated here by Jacques, Hammond and Fadili and applied here to de-quantizaton of speech signals by Christoph, Timo Gerkmann and myself.

We wanted to see if one of the most efficient methods known for sparse reconstruction with {\ell_{2}} penalty, namely the homotopy method, can be adapted to this case. The homotopy method for {\min_{x}\|x\|_{1} + \tfrac1{2\lambda}\|Ax-b\|_{2}^{2}} builds on the observation that the solution for {\lambda\geq \|A^{T}b\|_{\infty}} is zero and that the set of solutions {x_{\lambda}}, parameterized by the parameter {\lambda}, is piecewise linear. Hence, on can start from {\lambda_{0}= \|A^{T}b\|_{\infty}}, calculate which direction to go, how far the breakpoint is away, go there and start over. I’ve blogged on the homotopy method here already and there you’ll find some links to great software packages, but also the fact that there can be up to exponentially many breakpoints. However, in practice the homotopy method is usually blazingly fast and with some care, can be made numerically stable and accurate, see, e.g. our extensive study here (and here is the optimization online preprint).

The problem with {\ell_{\infty}} constraint seems similar, especially it is clear that for {\delta = \|b\|_{\infty}}, {x=0} is a solution. It is also not so difficult to see that there is a piecewise linear path of solutions {x_{\delta}}. What is not so clear is, how it can be computed. It turned out, that in this case the whole truth can be seen when the problem is viewed from a primal-dual viewpoint. The associated dual problem is

\displaystyle \begin{array}{rcl} \max_{y}\ -b^{T}y - \delta\|y\|_{1}\quad\text{s.t.}\quad\|A^{T}y\|_{\infty}\leq\infty \end{array}

and a pair {(x^{*},y^{*})} is primal and dual optimal if and only if

\displaystyle \begin{array}{rcl} -A^{T}y^{*}&\in&\text{Sign}(x^{*})\\ Ax^{*}-b & \in & \delta\text{Sign}(y^{*}) \end{array}

(where {\text{Sign}} denotes the sign function, multivalued at zero, giving {[-1,1]} there). One can note some things from the primal-dual optimality system:

  • You may find a direction {d} such that {(x^{*}+td,y^{*})} stays primal-dual optimal for the constraint {\leq\delta-t} for small {t},
  • for a fixed primal optimal {x_{\delta}} there may be several dual optimal {y_{\delta}}.

On the other hand, it is not that clear which of the probably many dual optimal {y_{\delta}} allows to find a new direction {d} such that {x_{\delta}+td} with stay primal optimal when reducing {\delta}. In fact, it turned out that, at a breakpoint, a new dual variable needs to be found to allow for the next jump in the primal variable. So, the solution path is piecewise linear in the primal variable, but piecewise constant in the dual variable (a situation similar to the adaptive inverse scale space method).

What we found is, that some adapted theorem of the alternative (a.k.a. Farkas’ Lemma) allows to calculate the next dual optimal {y} such that a jump in {x} will be possible.

What is more, is that the calculation of a new primal or dual optimal point amounts to solving a linear program (in contrast to a linear system for {\ell_{2}} homotopy). Hence, the trick of up- and downdating a suitable factorization of a suitable matrix to speed up computation does not work. However, one can somehow leverage the special structure of the problem and use a tailored active set method to progress through the path. Our numerical tests indicated is that the resulting method, which we termed {\ell_{1}}-Houdini, is able to solve moderately large problems faster than a commercial LP-solver (while also not only solving the given problem, but calculating the whole solution path on the fly) as can be seen from this table from the paper:

085_runtime_l1houdini

The code of \ell_1-Houdini is on Christoph’s homepage, you may also reproduce the data in the above table with your own hardware.