Optimization


Taking the derivative of the loss function of a neural network can be quite cumbersome. Even taking the derivative of a single layer in a neural network often results in expressions cluttered with indices. In this post I’d like to show an index-free way to do it.

Consider the map {\sigma(Wx+b)} where {W\in{\mathbb R}^{m\times n}} is the weight matrix, {b\in{\mathbb R}^{m}} is the bias, {x\in{\mathbb R}^{n}} is the input, and {\sigma} is the activation function. Usually {\sigma} represents both a scalar function (i.e. mapping {{\mathbb R}\mapsto {\mathbb R}}) and the function mapping {{\mathbb R}^{m}\rightarrow{\mathbb R}^{m}} which applies {\sigma} in each coordinate. In training neural networks, we would try to optimize for best parameters {W} and {b}. So we need to take the derivative with respect to {W} and {b}. So we consider the map

\displaystyle  \begin{array}{rcl}  G(W,b) = \sigma(Wx+b). \end{array}

This map {G} is a concatenation of the map {(W,b)\mapsto Wx+b} and {\sigma} and since the former map is linear in the joint variable {(W,b)}, the derivative of {G} should be pretty simple. What makes the computation a little less straightforward is the fact the we are usually not used to view matrix-vector products {Wx} as linear maps in {W} but in {x}. So let’s rewrite the thing:

There are two particular notions which come in handy here: The Kronecker product of matrices and the vectorization of matrices. Vectorization takes some {W\in{\mathbb R}^{m\times n}} given columnwise {W = [w_{1}\ \cdots\ w_{n}]} and maps it by

\displaystyle  \begin{array}{rcl}  \mathrm{Vec}:{\mathbb R}^{m\times n}\rightarrow{\mathbb R}^{mn},\quad \mathrm{Vec}(W) = \begin{bmatrix} w_{1}\\\vdots\\w_{n} \end{bmatrix}. \end{array}

The Kronecker product of matrices {A\in{\mathbb R}^{m\times n}} and {B\in{\mathbb R}^{k\times l}} is a matrix in {{\mathbb R}^{mk\times nl}}

\displaystyle  \begin{array}{rcl}  A\otimes B = \begin{bmatrix} a_{11}B & \cdots &a_{1n}B\\ \vdots & & \vdots\\ a_{m1}B & \cdots & a_{mn}B \end{bmatrix}. \end{array}

We will build on the following marvelous identity: For matrices {A}, {B}, {C} of compatible size we have that

\displaystyle  \begin{array}{rcl}  \mathrm{Vec}(ABC) = (C^{T}\otimes A)\mathrm{Vec}(B). \end{array}

Why is this helpful? It allows us to rewrite

\displaystyle  \begin{array}{rcl}  Wx & = & \mathrm{Vec}(Wx)\\ & = & \mathrm{Vec}(I_{m}Wx)\\ & = & \underbrace{(x^{T}\otimes I_{m})}_{\in{\mathbb R}^{m\times mn}}\underbrace{\mathrm{Vec}(W)}_{\in{\mathbb R}^{mn}}. \end{array}

So we can also rewrite

\displaystyle  \begin{array}{rcl}  Wx +b & = & \mathrm{Vec}(Wx+b )\\ & = & \mathrm{Vec}(I_{m}Wx + b)\\ & = & \underbrace{ \begin{bmatrix} x^{T}\otimes I_{m} & I_{m} \end{bmatrix} }_{\in{\mathbb R}^{m\times (mn+m)}}\underbrace{ \begin{bmatrix} \mathrm{Vec}(W)\\b \end{bmatrix} }_{\in{\mathbb R}^{mn+m}}\\ &=& ( \underbrace{\begin{bmatrix} x^{T} & 1 \end{bmatrix}}_{\in{\mathbb R}^{1\times(n+1)}}\otimes I_{m}) \begin{bmatrix} \mathrm{Vec}(W)\\b \end{bmatrix}. \end{array}

So our map {G(W,b) = \sigma(Wx+b)} mapping {{\mathbb R}^{m\times n}\times {\mathbb R}^{m}\rightarrow{\mathbb R}^{m}} can be rewritten as

\displaystyle  \begin{array}{rcl}  \bar G( \begin{pmatrix} \mathrm{Vec}(W)\\b \end{pmatrix}) = \sigma( ( \begin{bmatrix} x^{T} & 1 \end{bmatrix}\otimes I_{M}) \begin{bmatrix} \mathrm{Vec}(W)\\b \end{bmatrix}) \end{array}

mapping {{\mathbb R}^{mn+m}\rightarrow{\mathbb R}^{m}}. Since {\bar G} is just a concatenation of {\sigma} applied coordinate wise and a linear map, now given as a matrix, the derivative of {\bar G} (i.e. the Jacobian, a matrix in {{\mathbb R}^{m\times (mn+m)}}) is calculated simply as

\displaystyle  \begin{array}{rcl}  D\bar G( \begin{pmatrix} \mathrm{Vec}(W)\\b \end{pmatrix}) & = & D\sigma(Wx+b)( \begin{bmatrix} x^{T} & 1 \end{bmatrix}\otimes I_{M})\\ &=& \underbrace{\mathrm{diag}(\sigma'(Wx+b))}_{\in{\mathbb R}^{m\times m}}\underbrace{( \begin{bmatrix} x^{T} & 1 \end{bmatrix}\otimes I_{M})}_{\in{\mathbb R}^{m\times(mn+m)}}\in{\mathbb R}^{m\times(mn+m)}. \end{array}

While this representation of the derivative of a single layer of a neural network with respect to its parameters is not particularly simple, it is still index free and moreover, straightforward to implement in languages which provide functions for the Kronecker product and vectorization. If you do this, make sure to take advantage of sparse matrices for the identity matrix and the diagonal matrix as otherwise the memory of your computer will be flooded with zeros.

Now let’s add a scalar function {L} (e.g. to produce a scalar loss that we can minimize), i.e. we consider the map

\displaystyle  \begin{array}{rcl}  F( \begin{pmatrix} \mathrm{Vec}(W)\\b \end{pmatrix}) = L(G(Wx+b)) = L(\bar G( \begin{pmatrix} \mathrm{Vec}(W)\\b \end{pmatrix}). \end{array}

The derivative is obtained by just another application of the chain rule:

\displaystyle  \begin{array}{rcl}  DF( \begin{pmatrix} \mathrm{Vec}(W)\\b \end{pmatrix}) = DL(G(Wx+b))D\bar G( \begin{pmatrix} \mathrm{Vec}(W)\\b \end{pmatrix}). \end{array}

If we want to take gradients, we just transpose the expression and get

\displaystyle  \begin{array}{rcl}  \nabla F( \begin{pmatrix} \mathrm{Vec}(W)\\b \end{pmatrix}) &=& D\bar G( \begin{pmatrix} \mathrm{Vec}(W)\\b \end{pmatrix})^{T} DL(G(Wx+b))^{T}\\ &=& ([x^{T}\ 1]\otimes I_{m})^{T}\mathrm{diag}(\sigma'(Wx+b))\nabla L(G(Wx+b))\\ &=& \underbrace{( \begin{bmatrix} x\\ 1 \end{bmatrix} \otimes I_{m})}_{\in{\mathbb R}^{(mn+m)\times m}}\underbrace{\mathrm{diag}(\sigma'(Wx+b))}_{\in{\mathbb R}^{m\times m}}\underbrace{\nabla L(G(Wx+b))}_{\in{\mathbb R}^{m}}. \end{array}

Note that the right hand side is indeed vector in {{\mathbb R}^{mn+m}} and hence, can be reshaped to a tupel {(W,b)} of an {m\times n} matrix and an {m} vector.

A final remark: the Kronecker product is related to tensor products. If {A} and {B} represent linear maps {X_{1}\rightarrow Y_{1}} and {X_{2}\rightarrow Y_{2}}, respectively, then {A\otimes B} represents the tensor product of the maps, {X_{1}\otimes X_{2}\rightarrow Y_{1}\otimes Y_{2}}. This relation to tensor products and tensors explains where the tensor in TensorFlow comes from.

 

The problem of optimal transport of mass from one distribution to another can be stated in many forms. Here is the formulation going back to Kantorovich: We have two measurable sets {\Omega_{1}} and {\Omega_{2}}, coming with two measures {\mu_{1}} and {\mu_{2}}. We also have a function {c:\Omega_{1}\times \Omega_{2}\rightarrow {\mathbb R}} which assigns a transport cost, i.e. {c(x_{1},x_{2})} is the cost that it takes to carry one unit of mass from point {x_{1}\in\Omega_{2}} to {x_{2}\in\Omega_{2}}. What we want is a plan that says where the mass in {\mu_{1}} should be placed in {\Omega_{2}} (or vice versa). There are different ways to formulate this mathematically.

A simple way is to look for a map {T:\Omega_{1}\rightarrow\Omega_{2}} which says that thet mass in {x_{1}} should be moved to {T(x_{1})}. While natural, there is a serious problem with this: What if not all mass at {x_{1}} should go to the same point in {\Omega_{2}}? This happens in simple situations where all mass in {\Omega_{1}} sits in just one point, but there are at least two different points in {\Omega_{2}} where mass should end up. This is not going to work with a map {T} as above. So, the map {T} is not flexible enough to model all kinds of transport we may need.

What we want is a way to distribute mass from one point in {\Omega_{1}} to the whole set {\Omega_{2}}. This looks like we want maps {\mathcal{T}} which map points in {\Omega_{1}} to functions on {\Omega_{2}}, i.e. something like {\mathcal{T}:\Omega_{1}\rightarrow (\Omega_{2}\rightarrow{\mathbb R})} where {(\Omega_{2}\rightarrow{\mathbb R})} stands for some set of functions on {\Omega_{2}}. We can de-curry this function to some {\tau:\Omega_{1}\times\Omega_{2}\rightarrow{\mathbb R}} by {\tau(x_{1},x_{2}) = \mathcal{T}(x_{1})(x_{2})}. That’s good in principle, be we still run into problems when the target mass distribution {\mu_{2}} is singular in the sense that {\Omega_{2}} is a “continuous” set and there are single points in {\Omega_{2}} that carry some mass according to {\mu_{2}}. Since we are in the world of measure theory already, the way out suggests itself: Instead of a function {\tau} on {\Omega_{1}\times\Omega_{2}} we look for a measure {\pi} on {\Omega_{1}\times \Omega_{2}} as a transport plan.

The demand that we should carry all of the mass in {\Omega_{1}} to reach all of {\mu_{2}} is formulated by marginals. For simplicity we just write these constraints as

\displaystyle \int_{\Omega_{2}}\pi\, d x_{2} = \mu_{1},\qquad \int_{\Omega_{1}}\pi\, d x_{1} = \mu_{2}

(with the understanding that the first equation really means that for all continuous function {f:\Omega_{1}\rightarrow {\mathbb R}} it holds that {\int_{\Omega_{1}\times \Omega_{2}} f(x_{1})\,d\pi(x_{1},x_{2}) = \int_{\Omega_{1}}f(x_{1})\,d\mu_{1}(x_{1})}).

This leads us to the full transport problem

\displaystyle \min_{\pi}\int_{\Omega_{1}\times \Omega_{2}}c(x,y)\,d\pi(x_{1}x_{2})\quad \text{s.t.}\quad \int_{\Omega_{2}}\pi\, d x_{2} = \mu_{1},\quad \int_{\Omega_{1}}\pi\, d x_{1} = \mu_{2}.

There is the following theorem which characterizes optimality of a plan and which is the topic of this post:

Theorem 1 (Fundamental theorem of optimal transport) Under some technicalities we can say that a plan {\pi} which fulfills the marginal constraints is optimal if and only if one of the following equivalent conditions is satisfied:

  1. The support {\mathrm{supp}(\pi)} of {\pi} is {c}-cyclically monotone.
  2. There exists a {c}-concave function {\phi} such that its {c}-superdifferential contains the support of {\pi}, i.e. {\mathrm{supp}(\pi)\subset \partial^{c}\phi}.

 

A few clarifications: The technicalities involve continuity, integrability, and boundedness conditions of {c} and integrability conditions on the marginals. The full theorem can be found as Theorem 1.13 in A user’s guide to optimal transport by Ambrosio and Gigli. Also the notions {c}-cyclically monotone, {c}-concave and {c}-superdifferential probably need explanation. We start with a simpler notion: {c}-monotonicity:

Definition 2 A set {\Gamma\subset\Omega_{1}\times\Omega_{2}} is {c}-monotone, if for all {(x_{1}x_{2}),(x_{1}',x_{2}')\in\Gamma} it holds that

\displaystyle c(x_{1},x_{2}) + c(x_{1}',x_{2}')\leq c(x_{1},x_{2}') + c(x_{1}',x_{2}).

 

If you find it unclear what this has to do with monotonicity, look at this example:

Example 1 Let {\Omega_{1/2}\in{\mathbb R}^{d}} and let {c(x_{1},x_{2}) = \langle x_{1},x_{2}\rangle} be the usual scalar product. Then {c}-monotonicity is the condition that for all {(x_{1}x_{2}),(x_{1}',x_{2}')\in\Gamma\subset\Omega_{1}\times\Omega_{2}} it holds that

\displaystyle 0\leq \langle x_{1}-x_{1}',x_{2}-x_{2}'\rangle

which may look more familiar. Indeed, when {\Omega_{1}} and {\Omega_{2}} are subset of the real line, the above conditions means that the set {\Gamma} somehow “moves up in {\Omega_{2}}” if we “move right in {\Omega_{1}}”. So {c}-monotonicity for {c(x_{1},x_{2}) = \langle x_{1},x_{2}\rangle} is something like “monotonically increasing”. Similarly, for {c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle}, {c}-monotonicity means “monotonically decreasing”.

You may say that both {c(x_{1},x_{2}) = \langle x_{1},x_{2}\rangle} and {c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle} are strange cost functions and I can’t argue with that. But here comes: {c(x_{1},x_{2}) = |x_{1}-x_{2}|^{2}} ({|\,\cdot\,|} being the euclidean norm) seems more natural, right? But if we have a transport plan {\pi} for this {c} for some marginals {\mu_{1}} and {\mu_{2}} we also have

\displaystyle \begin{array}{rcl} \int_{\Omega_{1}\times \Omega_{2}}c(x_{1},x_{2})d\pi(x_{1},x_{2}) & = & \int_{\Omega_{1}\times \Omega_{2}}|x_{1}|^{2} d\pi(x_{1},x_{2})\\ &&\quad- \int_{\Omega_{1}\times \Omega_{2}}\langle x_{1},x_{2}\rangle d\pi(x_{1},x_{2})\\ && \qquad+ \int_{\Omega_{1}\times \Omega_{2}} |x_{2}|^{2}d\pi(x_{1},x_{2})\\ & = &\int_{\Omega_{1}}|x_{1}|^{2}d\mu_{1}(x_{1}) - \int_{\Omega_{1}\times \Omega_{2}}\langle x_{1},x_{2}\rangle d\pi(x_{1},x_{2}) + \int_{\Omega_{2}}|x_{2}|^{2}d\mu_{2}(x_{2}) \end{array}

i.e., the transport cost for {c(x_{1},x_{2}) = |x_{1}-x_{2}|^{2}} differs from the one for {c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle} only by a constant independent of {\pi}, so may well use the latter.

The fundamental theorem of optimal transport uses the notion of {c}-cyclical monotonicity which is stronger that just {c}-monotonicity:

Definition 3 A set {\Gamma\subset \Omega_{1}\times \Omega_{2}} is {c}-cyclically monotone, if for all {(x_{1}^{i},x_{2}^{i})\in\Gamma}, {i=1,\dots n} and all permutations {\sigma} of {\{1,\dots,n\}} it holds that

\displaystyle \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{i}) \leq \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{\sigma(i)}).

 

For {n=2} we get back the notion of {c}-monotonicity.

Definition 4 A function {\phi:\Omega_{1}\rightarrow {\mathbb R}} is {c}-concave if there exists some function {\psi:\Omega_{2}\rightarrow{\mathbb R}} such that

\displaystyle \phi(x_{1}) = \inf_{x_{2}\in\Omega_{2}}c(x_{1},x_{2}) - \psi(x_{2}).

 

This definition of {c}-concavity resembles the notion of convex conjugate:

Example 2 Again using {c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle} we get that a function {\phi} is {c}-concave if

\displaystyle \phi(x_{1}) = \inf_{x_{2}}-\langle x_{1},x_{2}\rangle - \psi(x_{2}),

and, as an infimum over linear functions, {\phi} is clearly concave in the usual way.

Definition 5 The {c}-superdifferential of a {c}-concave function is

\displaystyle \partial^{c}\phi = \{(x_{1},x_{2})\mid \phi(x_{1}) + \phi^{c}(x_{2}) = c(x,y)\},

where {\phi^{c}} is the {c}-conjugate of {\phi} defined by

\displaystyle \phi^{c}(x_{2}) = \inf_{x_{1}\in\Omega_{1}}c(x_{1},x_{2}) -\phi(x_{1}).

 

Again one may look at {c(x_{1},x_{2}) = -\langle x_{1},x_{2}\rangle} and observe that the {c}-superdifferential is the usual superdifferential related to the supergradient of concave functions (there is a Wikipedia page for subgradient only, but the concept is the same with reversed signs in some sense).

Now let us sketch the proof of the fundamental theorem of optimal transport: \medskip

Proof (of the fundamental theorem of optimal transport). Let {\pi} be an optimal transport plan. We aim to show that {\mathrm{supp}(\pi)} is {c}-cyclically monotone and assume the contrary. That is, we assume that there are points {(x_{1}^{i},x_{2}^{i})\in\mathrm{supp}(\pi)} and a permutation {\sigma} such that

\displaystyle \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{i}) > \sum_{i=1}^{n}c(x_{1}^{i},x_{2}^{\sigma(i)}).

We aim to construct a {\tilde\pi} such that {\tilde\pi} is still feasible but has a smaller transport cost. To do so, we note that continuity of {c} implies that there are neighborhoods {U_{i}} of {x_{1}^{i}} and {V_{i}} of {x_{2}^{i}} such that for all {u_{i}\in U_{1}} and {v_{i}\in V_{i}} it holds that

\displaystyle \sum_{i=1}^{n}c(u_{i},v_{\sigma(i)}) - c(u_{i},v_{i})<0.

We use this to construct a better plan {\tilde \pi}: Take the mass of {\pi} in the sets {U_{i}\times V_{i}} and shift it around. The full construction is a little messy to write down: Define a probability measure {\nu} on the product {X = \bigotimes_{i=1}^{N}U_{i}\times V_{i}} as the product of the measures {\tfrac{1}{\pi(U_{i}\times V_{i})}\pi|_{U_{i}\times V_{i}}}. Now let {P^{U_{1}}} and {P^{V_{i}}} be the projections of {X} onto {U_{i}} and {V_{i}}, respectively, and set

\displaystyle \nu = \tfrac{\min_{i}\pi(U_{i}\times V_{i})}{n}\sum_{i=1}^{n}(P^{U_{i}},P^{V_{\sigma(i)}})_{\#}\nu - (P^{U_{i}},P^{V_{i}})_{\#}\nu

where {_{\#}} denotes the pushforward of measures. Note that the new measure {\nu} is signed and that {\tilde\pi = \pi + \nu} fulfills

  1. {\tilde\pi} is a non-negative measure
  2. {\tilde\pi} is feasible, i.e. has the correct marginals
  3. {\int c\,d\tilde \pi<\int c\,d\pi}

which, all together, gives a contradiction to optimality of {\pi}. The implication of item 1 to item 2 of the theorem is not really related to optimal transport but a general fact about {c}-concavity and {c}-cyclical monotonicity (c.f.~this previous blog post of mine where I wrote a similar statement for convexity). So let us just prove the implication from item 2 to optimality of {\pi}: Let {\pi} fulfill item 2, i.e. {\pi} is feasible and {\mathrm{supp}(\pi)} is contained in the {c}-superdifferential of some {c}-concave function {\phi}. Moreover let {\tilde\pi} be any feasible transport plan. We aim to show that {\int c\,d\pi\leq \int c\,d\tilde\pi}. By definition of the {c}-superdifferential and the {c}-conjugate we have

\displaystyle \begin{array}{rcl} \phi(x_{1}) + \phi^{c}(x_{2}) &=& c(x_{1},x_{2})\ \forall (x_{1},x_{2})\in\partial^{c}\phi\\ \phi(x_{1}) + \phi^{c}(x_{2}) & \leq& c(x_{1},x_{2})\ \forall (x_{1},x_{2})\in\Omega_{1}\times \Omega_{2}. \end{array}

Since {\mathrm{supp}(\pi)\subset\partial^{c}\phi} by assumption, this gives

\displaystyle \begin{array}{rcl} \int_{\Omega_{1}\times \Omega_{2}}c(x_{1},x_{2})\,d\pi(x_{1},x_{2}) & =& \int_{\Omega_{1}\times \Omega_{2}}\phi(x_{1}) + \phi^{c}(x_{1})\,d\pi(x_{1},x_{2})\\ &=& \int_{\Omega_{1}}\phi(x_{1})\,d\mu_{1}(x_{1}) + \int_{\Omega_{1}}\phi^{c}(x_{2})\,d\mu_{2}(x_{2})\\ &=& \int_{\Omega_{1}\times \Omega_{2}}\phi(x_{1}) + \phi^{c}(x_{1})\,d\tilde\pi(x_{1},x_{2})\\ &\leq& \int_{\Omega_{1}\times \Omega_{2}}c(x_{1},x_{2})\,d\tilde\pi(x_{1},x_{2}) \end{array}

which shows the claim.

{\Box}

Corollary 6 If {\pi} is a measure on {\Omega_{1}\times \Omega_{2}} which is supported on a {c}-superdifferential of a {c}-concave function, then {\pi} is an optimal transport plan for its marginals with respect to the transport cost {c}.

 

This is a short follow up on my last post where I wrote about the sweet spot of the stepsize of the Douglas-Rachford iteration. For the case \beta-Lipschitz + \mu-strongly monotone, the iteration with stepsize t converges linear with rate

\displaystyle r(t) = \tfrac{1}{2(1+t\mu)}\left(\sqrt{2t^{2}\mu^{2}+2t\mu + 1 +2(1 - \tfrac{1}{(1+t\beta)^{2}} - \tfrac1{1+t^{2}\beta^{2}})t\mu(1+t\mu)} + 1\right)

Here is animated plot of this contraction factor depending on \beta and \mu and t acts as time variable:

DR_contraction

What is interesting is, that this factor has increasing or decreasing in t depending on the values of \beta and \mu.

For each pair (\beta,\mu) there is a best t^* and also a smallest contraction factor r(t^*). Here are plots of these quantities:
DR_opt_stepsize

Comparing the plot of te optimal contraction factor to the animated plot above, you see that the right choice of the stepsize matters a lot.

I blogged about the Douglas-Rachford method before here and here. It’s a method to solve monotone inclusions in the form

\displaystyle 0 \in Ax + Bx

with monotone multivalued operators {A,B} from a Hilbert space into itself. Using the resolvent {J_{A} = (I+A)^{-1}} and the reflector {R_{A} = 2J_{A} - I}, the Douglas-Rachford iteration is concisely written as

\displaystyle u^{n+1} = \tfrac12(I + R_{B}R_{A})u_{n}.

The convergence of the method has been clarified is a number of papers, see, e.g.

Lions, Pierre-Louis, and Bertrand Mercier. “Splitting algorithms for the sum of two nonlinear operators.” SIAM Journal on Numerical Analysis 16.6 (1979): 964-979.

for the first treatment in the context of monotone operators and

Svaiter, Benar Fux. “On weak convergence of the Douglas–Rachford method.” SIAM Journal on Control and Optimization 49.1 (2011): 280-287.

for a recent very general convergence result.

Since {tA} is monotone if {A} is monotone and {t>0}, we can introduce a stepsize for the Douglas-Rachford iteration

\displaystyle u^{n+1} = \tfrac12(I + R_{tB}R_{tA})u^{n}.

It turns out, that this stepsize matters a lot in practice; too small and too large stepsizes lead to slow convergence. It is a kind of folk wisdom, that there is “sweet spot” for the stepsize. In a recent preprint Quoc Tran-Dinh and I investigated this sweet spot in the simple case of linear operators {A} and {B} and this tweet has a visualization.

A few days ago Walaa Moursi and Lieven Vandenberghe published the preprint “Douglas-Rachford splitting for a Lipschitz continuous and a strongly monotone operator” and derived some linear convergence rates in the special case they mention in the title. One result (Theorem 4.3) goes as follows: If {A} is monotone and Lipschitz continuous with constant {\beta} and {B} is maximally monotone and {\mu}-strongly monotone, than the Douglas-Rachford iterates converge strongly to a solution with a linear rate

\displaystyle r = \tfrac{1}{2(1+\mu)}\left(\sqrt{2\mu^{2}+2\mu + 1 +2(1 - \tfrac{1}{(1+\beta)^{2}} - \tfrac1{1+\beta^{2}})\mu(1+\mu)} + 1\right).

This is a surprisingly complicated expression, but there is a nice thing about it: It allows to optimize for the stepsize! The rate depends on the stepsize as

\displaystyle r(t) = \tfrac{1}{2(1+t\mu)}\left(\sqrt{2t^{2}\mu^{2}+2t\mu + 1 +2(1 - \tfrac{1}{(1+t\beta)^{2}} - \tfrac1{1+t^{2}\beta^{2}})t\mu(1+t\mu)} + 1\right)

and the two plots of this function below show the sweet spot clearly.

 

If one knows the Lipschitz constant of {A} and the constant of strong monotonicity of {B}, one can minimize {r(t)} to get on optimal stepsize (in the sense that the guaranteed contraction factor is as small as possible). As Moursi and Vandenberghe explain in their Remark 5.4, this optimization involves finding the root of a polynomial of degree 5, so it is possible but cumbersome.

Now I wonder if there is any hope to show that the adaptive stepsize Quoc and I proposed here (which basically amounts to {t_{n} = \|u^{n}\|/\|Au^{n}\|} in the case of single valued {A} – note that the role of {A} and {B} is swapped in our paper) is able to find the sweet spot (experimentally it does).
<p

Consider the saddle point problem

\displaystyle   \min_{x}\max_{y}F(x) + \langle Kx,y\rangle - G(y). \ \ \ \ \ (1)

(where I omit all the standard assumptions, like convexity, continuity ans such…). Fenchel-Rockafellar duality says that solutions are characterized by the inclusion

\displaystyle  0 \in\left( \begin{bmatrix} \partial F & 0\\ 0 & \partial G \end{bmatrix} + \begin{bmatrix} 0 & K^{T}\\ -K & 0 \end{bmatrix}\right) \begin{bmatrix} x^{*}\\y^{*} \end{bmatrix}

Noting that the operators

\displaystyle  A = \begin{bmatrix} \partial F & 0\\ 0 & \partial G \end{bmatrix},\quad B = \begin{bmatrix} 0 & K^{T}\\ -K & 0 \end{bmatrix}

are both monotone, we may apply any of the splitting methods available, for example the Douglas-Rachford method. In terms of resolvents

\displaystyle  R_{tA}(z) := (I+tA)^{-1}(z)

this method reads as

\displaystyle  \begin{array}{rcl}  z^{k+1} & = & R_{tB}(\bar z^{k})\\ \bar z^{k+1}& = & R_{tA}(2z^{k+1}-\bar z^{k}) + \bar z^{k}-z^{k+1}. \end{array}

For the saddle point problem, this iteration is (with {z = (x,y)})

\displaystyle  \begin{array}{rcl}  x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ y^{k+1} &=& R_{t\partial G}(\bar y^{k})\\ \begin{bmatrix} \bar x^{k+1}\\ \bar y^{k+1} \end{bmatrix} & = & \begin{bmatrix} I & tK^{T}\\ -tK & I \end{bmatrix}^{-1} \begin{bmatrix} 2x^{k+1}-\bar x^{k}\\ 2y^{k+1}-\bar y^{k} \end{bmatrix} + \begin{bmatrix} \bar x^{k}- x^{k+1}\\ \bar y^{k}-y^{k+1} \end{bmatrix}. \end{array}

The first two lines involve proximal steps and we assume that they are simple to implement. The last line, however, involves the solution of a large linear system. This can be broken down to a slightly smaller linear system involving the matrix {(I+t^{2}K^{T}K)} as follows: The linear system equals

\displaystyle  \begin{array}{rcl}  \bar x^{k+1} & = & x^{k+1} - tK^{T}(y^{k+1}+\bar y^{k+1}-\bar y^{k})\\ \bar y^{k+1} & = & y^{k+1} + tK(x^{k+1} + \bar x^{k+1}-\bar x^{k}). \end{array}

Plugging {\bar y^{k+1}} from the second equation into the first gives

\displaystyle  \bar x^{k+1} = x^{k+1} - tK^{T}(2y^{k+1}-\bar y^{k}) - tK^{T}K(x^{k+1}-\bar x^{k+1}-\bar x^{k})

Denoting {d^{k+1}= x^{k+1}+\bar x^{k+1}-\bar x^{k}} this can be written as

\displaystyle  (I+t^{2}K^{T}K)d^{k+1} = (2x^{k+1}-\bar x^{k}) - tK^{T}(2y^{k+1}-\bar y^{k}).

and the second equation is just

\displaystyle  \bar y^{k+1} = y^{k+1} + tKd^{k+1}.

This gives the overall iteration

\displaystyle  \begin{array}{rcl}  x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ y^{k+1} &=& R_{t\partial G}(\bar y^{k})\\ d^{k+1} &=& (I+t^{2}K^{T}K)^{-1}(2x^{k+1}-\bar x^{k} - tK(2y^{k+1}-\bar y^{k}))\\ \bar x^{k+1}&=& \bar x^{k}-x^{k+1}+d^{k+1}\\ \bar y^{k+1}&=& y^{k+1}+tKd^{k+1} \end{array}

This is nothing else than using the Schur complement or factoring as

\displaystyle  \begin{bmatrix} I & tK^{T}\\ -tK & I \end{bmatrix} = \begin{bmatrix} 0 & 0\\ 0 & I \end{bmatrix} + \begin{bmatrix} I\\tK \end{bmatrix} (I + t^{2}K^{T}K)^{-1} \begin{bmatrix} I & -tK^{T} \end{bmatrix}

and has been applied to imaging problems by O’Connor and Vandenberghe in “Primal-Dual Decomposition by Operator Splitting and Applications to Image Deblurring” (doi). For many problems in imaging, the involved inversion may be fairly easy to perform (if {K} is the image gradient, for example, we only need to solve an equation with an operator like {(I - t^{2}\Delta)} and appropriate boundary conditions). However, there are problems where this inversion is a problem.

I’d like to show the following trick to circumvent the matrix inversion, which I learned from Bredies and Sun’s “Accelerated Douglas-Rachford methods for the solution of convex-concave saddle-point problems”: Here is a slightly different saddle point problem

\displaystyle   \min_{x}\max_{y,x_{p}}F(x) + \langle Kx,y\rangle + \langle Hx,x_{p}\rangle- G(y) - I_{\{0\}}(x_{p}). \ \ \ \ \ (2)

We added a new dual variable {x_{p}}, which is forced to be zero by the additional indicator functional {I_{\{0\}}}. Hence, the additional bilinear term {\langle Hx,x_{p}\rangle} is also zero, and we see that {(x,y)} is a solution of (1) if and only if {(x,y,0)} is a solution of (2). In other words: The problem just looks differently, but is, in essence, the same as before.

Now let us write down the Douglas-Rachford iteration for (2). We write this problem as

\displaystyle  \min_{x}\max_{\tilde y} F(x) + \langle \tilde Kx,\tilde y\rangle -\tilde G(\tilde y)

with

\displaystyle  \tilde y = \begin{bmatrix} y\\x_{p} \end{bmatrix}, \quad \tilde K = \begin{bmatrix} K\\H \end{bmatrix}, \quad \tilde G(\tilde y) = \tilde G(y,x_{p}) = G(y) + I_{\{0\}}(x_{p}).

Writing down the Douglas-Rachford iteration gives

\displaystyle  \begin{array}{rcl}  x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ \tilde y^{k+1} &=& R_{t\partial \tilde G}(\bar{ \tilde y}^{k})\\ \begin{bmatrix} \bar x^{k+1}\\ \bar {\tilde y}^{k+1} \end{bmatrix} & = & \begin{bmatrix} I & t\tilde K^{T}\\ -t\tilde K & I \end{bmatrix}^{-1} \begin{bmatrix} 2x^{k+1}-\bar x^{k}\\ 2\tilde y^{k+1}-\bar {\tilde y}^{k} \end{bmatrix} + \begin{bmatrix} \bar x^{k}- x^{k+1}\\ \bar {\tilde y}^{k}-\tilde y^{k+1} \end{bmatrix}. \end{array}

Switching back to variables without a tilde, we get, using {R_{tI_{\{0\}}}(x) = 0},

\displaystyle  \begin{array}{rcl}  x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ y^{k+1} &=& R_{t\partial \tilde G}(\bar{ y}^{k})\\ x_{p}^{k+1} &=& 0\\ \begin{bmatrix} \bar x^{k+1}\\ \bar {y}^{k+1}\\ \bar x_{p}^{k+1} \end{bmatrix} & = & \begin{bmatrix} I & tK^{T} & tH^{T}\\ -t K & I & 0\\ -t H & 0 & I \end{bmatrix}^{-1} \begin{bmatrix} 2x^{k+1}-\bar x^{k}\\ 2 y^{k+1}-\bar {y}^{k}\\ 2x_{p}^{k+1}-\bar x_{p}^{k} \end{bmatrix} + \begin{bmatrix} \bar x^{k}- x^{k+1}\\ \bar {y}^{k}-y^{k+1}\\ \bar x_{p}^{k}-x_{p}^{k+1} \end{bmatrix}. \end{array}

First not that {x_{p}^{k+1}=0} throughout the iteration and from the last line of the linear system we get that

\displaystyle  \begin{array}{rcl}  -tH\bar x^{k+1} + \bar x_{p}^{k+1} = -\bar x_{p}^{k} -tH(\bar x^{k}-x^{k+1}) + \bar x_{p}^{k} \end{array}

which implies that

\displaystyle  \bar x_{p}^{k+1} = tH\bar x^{k+1}.

Thus, both variables {x_{p}^{k}} and {\bar x_{p}^{k}} disappear in the iteration. Now we rewrite the remaining first two lines of the linear system as

\displaystyle  \begin{array}{rcl}  \bar x^{k+1} + tK^{T}\bar y^{k+1} + t^{2}H^{T}H\bar x^{k+1} &=& x^{k+1} + tK^{T}(\bar y^{k}-y^{k+1}) + t^{2}H^{T}H\bar x^{k}\\ \bar y^{k+1}-tK\bar x^{k+1} &=& y^{k+1} + tK(x^{k+1}-\bar x^{k}). \end{array}

Again denoting {d^{k+1}=x^{k+1}+\bar x^{k+1}-\bar x^{k}}, solving the second equation for {\bar y^{k+1}} and plugging the result in the first gives

\displaystyle  (I+t^{2}H^{T}H)\bar x^{k+1} +tK^{T}(y^{k+1}+tKd^{k+1}) = x^{k+1}+tK(\bar y^{k}-y^{k+1}) + t^{2}H^{T}H\bar x^{k}.

To eliminate {\bar x^{k+1}} we add {(I+t^{2}H^{T}H)(x^{k+1}-\bar x^{k})} on both sides and get

\displaystyle  (I+t^{2}(H^{T}H+K^{T}K))d^{k+1} = 2x^{k+1}-\bar x^{k} -tK(y^{k+1}+\bar y^{k+1}-\bar y^{k}) + t^{2}H^{T}Hx^{k+1}.

In total we obtain the following iteration:

\displaystyle  \begin{array}{rcl}  x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ y^{k+1} &=& R_{t\partial G}(\bar y^{k})\\ d^{k+1} &=& (I+t^{2}(H^{T}H + K^{T}K))^{-1}(2x^{k+1}-\bar x^{k} - tK(2y^{k+1}-\bar y^{k}) + t^{2}H^{T}Hx^{k+1})\\ \bar x^{k+1}&=& \bar x^{k}-x^{k+1}+d^{k+1}\\ \bar y^{k+1}&=& y^{k+1}+tKd^{k+1} \end{array}

and note that only the third line changed.

Since the above works for any matrix {H}, we have a lot of freedom. Let us see, that it is even possible to avoid any inversion whatsoever: We would like to choose {H} in a way that {I+t^{2}(H^{T}H + K^{T}K) = \lambda I} for some positive {\lambda}. This is equivalent to

\displaystyle  H^{T}H = \tfrac{\lambda-1}{t^{2}}I - K^{T}K.

As soon as the right hand side is positive definite, Cholesky decomposition shows that such an {H} exists, and this happens if {\lambda\geq 1+t^{2}\|K\|^{2}}. Further note, that we do need {H} in any way, but only {H^{T}H}, and we can perform the iteration without ever solving any linear system since the third row reads as

\displaystyle  d^{k+1} = \tfrac{1}{\lambda}\left(2x^{k+1}-\bar x^{k} - tK(2y^{k+1}-\bar y^{k}) + ((\lambda-1)I - t^{2}K^{T}K)x^{k+1})\right).

I blogged about the Douglas-Rachford method before and in this post I’d like to dig a bit into the history of the method.

As the name suggests, the method has its roots in a paper by Douglas and Rachford and the paper is

Douglas, Jim, Jr., and Henry H. Rachford Jr., “On the numerical solution of heat conduction problems in two and three space variables.” Transactions of the American mathematical Society 82.2 (1956): 421-439.

At first glance, the title does not suggest that the paper may be related to monotone inclusions and if you read the paper you’ll not find any monotone operator mentioned. So let’s start and look at Douglas and Rachford’s paper.

1. Solving the heat equation numerically

So let us see, what they were after and how this is related to what is known as Douglas-Rachford splitting method today.

Indeed, Douglas and Rachford wanted to solve the instationary heat equation

\displaystyle \begin{array}{rcl} \partial_{t}u &=& \partial_{xx}u + \partial_{yy}u \\ u(x,y,0) &=& f(x,y) \end{array}

with Dirichlet boundary conditions (they also considered three dimensions, but let us skip that here). They considered a rectangular grid and a very simple finite difference approximation of the second derivatives, i.e.

\displaystyle \begin{array}{rcl} \partial_{xx}u(x,y,t)&\approx& (u^{n}_{i+1,j}-2u^{n}_{i,j}+u^{n}_{i-1,j})/h^{2}\\ \partial_{yy}u(x,y,t)&\approx& (u^{n}_{i,j+1}-2u^{n}_{i,j}+u^{n}_{i,j-1})/h^{2} \end{array}

(with modifications at the boundary to accomodate the boundary conditions). To ease notation, we abbreviate the difference quotients as operators (actually, also matrices) that act for a fixed time step

\displaystyle \begin{array}{rcl} (Au^{n})_{i,j} &=& (u^{n}_{i+1,j}-2u^{n}_{i,j}+u^{n}_{i-1,j})/h^{2}\\ (Bu^{n})_{i,j} &=& (u^{n}_{i,j+1}-2u^{n}_{i,j}+u^{n}_{i,j+1})/h^{2}. \end{array}

With this notation, our problem is to solve

\displaystyle \begin{array}{rcl} \partial_{t}u &=& (A+B)u \end{array}

in time.

Then they give the following iteration:

\displaystyle Av^{n+1}+Bw^{n} = \frac{v^{n+1}-w^{n}}{\tau} \ \ \ \ \ (1)

 

\displaystyle Bw^{n+1} = Bw^{n} + \frac{w^{n+1}-v^{n+1}}{\tau} \ \ \ \ \ (2)

 

(plus boundary conditions which I’d like to swipe under the rug here). If we eliminate {v^{n+1}} from the first equation using the second we get

\displaystyle (A+B)w^{n+1} = \frac{w^{n+1}-w^{n}}{\tau} + \tau AB(w^{n+1}-w^{n}). \ \ \ \ \ (3)

 

This is a kind of implicit Euler method with an additional small term {\tau AB(w^{n+1}-w^{n})}. From a numerical point of it has one advantage over the implicit Euler method: As equations (1) and (2) show, one does not need to invert {I-\tau(A+B)} in every iteration, but only {I-\tau A} and {I-\tau B}. Remember, this was in 1950s, and solving large linear equations was a much bigger problem than it is today. In this specific case of the heat equation, the operators {A} and {B} are in fact tridiagonal, and hence, solving with {I-\tau A} and {I-\tau B} can be done by Gaussian elimination without any fill-in in linear time (read Thomas algorithm). This is a huge time saver when compared to solving with {I-\tau(A+B)} which has a fairly large bandwidth (no matter how you reorder).

How do they prove convergence of the method? They don’t since they wanted to solve a parabolic PDE. They were after stability of the scheme, and this can be done by analyzing the eigenvalues of the iteration. Since the matrices {A} and {B} are well understood, they were able to write down the eigenfunctions of the operator associated to iteration (3) explicitly and since the finite difference approximation is well understood, they were able to prove approximation properties. Note that the method can also be seen, as a means to calculate the steady state of the heat equation.

We reformulate the iteration (3) further to see how {w^{n+1}} is actually derived from {w^{n}}: We obtain

\displaystyle (-I + \tau(A+B) - \tau^{2}AB)w^{n+1} = (-I-\tau^{2}AB)w^{n} \ \ \ \ \ (4)

 

2. What about monotone inclusions?

What has the previous section to do with solving monotone inclusions? A monotone inclusion is

\displaystyle \begin{array}{rcl} 0\in Tx \end{array}

with a monotone operator, that is, a multivalued mapping {T} from a Hilbert space {X} to (subsets of) itself such that for all {x,y\in X} and {u\in Tx} and {v\in Ty} it holds that

\displaystyle \begin{array}{rcl} \langle u-v,x-y\rangle\geq 0. \end{array}

We are going to restrict ourselves to real Hilbert spaces here. Note that linear operators are monotone if they are positive semi-definite and further note that monotone linear operators need not to be symmetric. A general approach to the solution of monotone inclusions are so-called splitting methods. There one splits {T} additively {T=A+B} as a sum of two other monotone operators. Then one tries to use the so-called resolvents of {A} and {B}, namely

\displaystyle \begin{array}{rcl} R_{A} = (I+A)^{-1},\qquad R_{B} = (I+B)^{-1} \end{array}

to obtain a numerical method. By the way, the resolvent of a monotone operator always exists and is single valued (to be honest, one needs a regularity assumption here, namely one need maximal monotone operators, but we will not deal with this issue here).

The two operators {A = \partial_{xx}} and {B = \partial_{yy}} from the previous section are not monotone, but {-A} and {-B} are, so the equation {-Au - Bu = 0} is a special case of a montone inclusion. To work with monotone operators we rename

\displaystyle \begin{array}{rcl} A \leftarrow -A,\qquad B\leftarrow -B \end{array}

and write the iteration~(4) in terms of monotone operators as

\displaystyle \begin{array}{rcl} (I + \tau(A+B) + \tau^{2}AB)w^{n+1} = (I+\tau^{2}AB)w^{n}, \end{array}

i.e.

\displaystyle \begin{array}{rcl} w^{n+1} = (I+\tau A+\tau B+\tau^{2}AB)^{-1}(I+\tau AB)w^{n}. \end{array}

Using {I+\tau A+\tau B + \tau^{2}A = (I+\tau A)(I+\tau B)} and {(I+\tau^{2}AB) = (I-\tau B) + (I + \tau A)\tau B} we rewrite this in terms of resolvents as

\displaystyle \begin{array}{rcl} w^{n+1} & = &(I+\tau B)^{-1}[(I+\tau A)^{-1}(I-\tau B) + \tau B]w^{n}\\ & =& R_{\tau B}(R_{\tau A}(w^{n}-\tau Bw^{n}) + \tau Bw^{n}). \end{array}

This is not really applicable to a general monotone inclusion since there {A} and {B} may be multi-valued, i.e. the term {Bw^{n}} is not well defined (the iteration may be used as is for splittings where {B} is monotone and single valued, though).

But what to do, when both and {A} and {B} are multivaled? The trick is, to introduce a new variable {w^{n} = R_{\tau B}(u^{n})}. Plugging this in throughout leads to

\displaystyle \begin{array}{rcl} R_{\tau B} u^{n+1} & = & R_{\tau B}(R_{\tau A}(R_{\tau B}u^{n}-\tau B R_{\tau B}u^{n}) + \tau B R_{\tau B}u^{n}). \end{array}

We cancel the outer {R_{\tau B}} and use {\tau B R_{\tau B}u^{n} = u^{n} - R_{\tau B}u^{n}} to get

\displaystyle \begin{array}{rcl} u^{n+1} & = & R_{\tau A}(2R_{\tau B}u^{n} - u^{n}) + u^{n} - R_{\tau B}u^{n} \end{array}

and here we go: This is exactly what is known as Douglas-Rachford method (see the last version of the iteration in my previous post). Note that it is not {u^{n}} that converges to a solution, but {w^{n} = R_{\tau B}u^{n}}, so it is convenient to write the iteration in the two variables

\displaystyle \begin{array}{rcl} w^{n} & = & R_{\tau B}u^{n}\\ u^{n+1} & = & R_{\tau A}(2w^{n} - u^{n}) + u^{n} - w^{n}. \end{array}

The observation, that these splitting method that Douglas and Rachford devised for linear problems has a kind of much wider applicability is due to Lions and Mercier and the paper is

Lions, Pierre-Louis, and Bertrand Mercier. “Splitting algorithms for the sum of two nonlinear operators.” SIAM Journal on Numerical Analysis 16.6 (1979): 964-979.

Other, much older, splitting methods for linear systems, such as the Jacobi method, the Gauss-Seidel method used different properties of the matrices such as the diagonal of the matrix or the upper and lower triangluar parts and as such, do not generalize easily to the case of operators on a Hilbert space.

Consider a convex optimization problem of the form

\displaystyle \begin{array}{rcl} \min_{x}F(x) + G(Ax) \end{array}

with convex {F} and {G} and matrix {A}. (We formulate everything quite loosely, skipping over details like continuity and such, as they are irrelevant for the subject matter). Optimization problems of this type have a specific type of dual problem, namely the Fenchel-Rockafellar dual, which is

\displaystyle \begin{array}{rcl} \max_{y}-F^{*}(-A^{T}y) - G^{*}(y) \end{array}

and under certain regularity conditions it holds that the optimal value of the dual equals the the objective value of the primal and, moreover, that a pair {(x^{*},y^{*})} is both primal and dual optimal if and only if the primal dual gap is zero, i.e. if and only if

\displaystyle \begin{array}{rcl} F(x^{*})+G(Ax^{*}) + F^{*}(-A^{T}y^{*})+G^{*}(y^{*}) = 0. \end{array}

Hence, it is quite handy to use the primal dual gap as a stopping criteria for iterative methods to solve these problems. So, if one runs an algorithm which produces primal iterates {x^{k}} and dual iterates {y^{k}} one can monitor

\displaystyle \begin{array}{rcl} \mathcal{G}(x^{k},y^{k}) = F(x^{k})+G(Ax^{k}) + F^{*}(-A^{T}y^{k})+G^{*}(y^{k}). \end{array}

and stop if the value falls below a desired tolerance.

There is some problem with this approach which appears if the method produces infeasible iterates in the sense that one of the four terms in {\mathcal{G}} is actually {+\infty}. This may be the case if {F} or {G} are not everywhere finite or, loosely speaking, have linear growth in some directions (since then the respective conjugate will not be finite everywhere). In the rest of the post, I’ll sketch a general method that can often solve this particular problem.

For the sake of simplicity, consider the following primal dual algorithm

\displaystyle \begin{array}{rcl} x^{k+1} & = &\mathrm{prox}_{\tau F}(x^{k}-A^{T}y^{k})\\ y^{k+1} & = &\mathrm{prox}_{\sigma G^{*}}(y^{k}+\sigma A(2x^{k+1}-x^{k})) \end{array}

(also know as primal dual hybrid gradient method or Chambolle-Pock’s algorithm). It converges as soon as {\sigma\tau\leq \|A\|^{-2}}.

While the structure of the algorithm ensures that {F(x^{k})} and {G^{*}(y^{k})} are always finite (since always {\mathrm{prox}_{F}(x)\in\mathrm{dom}(F)}), is may be that {F^{*}(-A^{T}y^{k})} or {G(Ax^{k})} are indeed infinite, rendering the primal dual gap useless.

Let us assume that the problematic term is {F^{*}(-A^{T}y^{k})}. Here is a way out in the case where one can deduce some a-priori bounds on {x^{*}}, i.e. a bounded and convex set {C} with {x^{*}\in C}. In fact, this is often the case (e.g. one may know a-priori that there exist lower bounds {l_{i}} and upper bounds {u_{i}}, i.e. it holds that {l_{i}\leq x^{*}_{i}\leq u_{i}}). Then, adding these constraints to the problem will not change the solution.

Let us see, how this changes the primal dual gap: We set {\tilde F(x) = F(x) + I_{C}(x)} where {C} is the set which models the bound constraints. Since {C} is a bounded convex set and {F} is finite on {C}, it is clear that

\displaystyle \begin{array}{rcl} \tilde F^{*}(\xi) = \sup_{x\in C}\,\langle \xi,x\rangle - F(x) \end{array}

is finite for every {\xi}. This leads to a finite duality gap. However, one should also adapt the prox operator. But this is also simple in the case where the constraint {C} and the function {F} are separable, i.e. {C} encodes bound constraints as above (in other words {C = [l_{1},u_{1}]\times\cdots\times [l_{n},u_{n}]}) and

\displaystyle \begin{array}{rcl} F(x) = \sum_{i} f_{i}(x_{i}). \end{array}

Here it holds that

\displaystyle \begin{array}{rcl} \mathrm{prox}_{\sigma \tilde F}(x)_{i} = \mathrm{prox}_{\sigma f_{i} + I_{[l_{i},u_{i}]}}(x_{i}) \end{array}

and it is simple to see that

\displaystyle \begin{array}{rcl} \mathrm{prox}_{\sigma f_{i} + I_{[l_{i},u_{i}]}}(x_{i}) = \mathrm{proj}_{[l_{i},u_{i}]}\mathrm{prox}_{\tau f_{i}}(x_{i}), \end{array}

i.e., only uses the proximal operator of {F} and project onto the constraints. For general {C}, this step may be more complicated.

One example, where this makes sense is {L^{1}-TV} denoising which can be written as

\displaystyle \begin{array}{rcl} \min_{u}\|u-u^{0}\|_{1} + \lambda TV(u). \end{array}

Here we have

\displaystyle \begin{array}{rcl} F(u) = \|u-u^{0}\|_{1},\quad A = \nabla,\quad G(\phi) = I_{|\phi_{ij}|\leq 1}(\phi). \end{array}

The guy that causes problems here is {F^{*}} which is an indicator functional and indeed {A^{T}\phi^{k}} will usually be dual infeasible. But since {u} is an image with a know range of gray values one can simple add the constraints {0\leq u\leq 1} to the problem and obtains a finite dual while still keeping a simple proximal operator. It is quite instructive to compute {\tilde F} in this case.

Here is a small signal boost for the

Workshop on the Interface of Statistics and Optimization

to  be held at Duke University, Durham, North Carolina from Feb 8 to Feb 10 2017. The workshop is part of the long-year program on optimization currently taking place at the Statistical and Applied Mathematical Sciences Institute (SAMSI).

There will be a lineup of invited speakers from the forefront of Statistics and Optimization each of which has made influential contributions to the other field as well. The planning is still ongoing, and hence, the list of speakers will grow some.

If you can’t make it to North Carolina next February, still mark the date since the talks will be broadcasted via the net and (if tech works out) you may even participate in the Q&A sessions after the talks via your computer.

Last week Christoph Brauer, Andreas Tillmann and myself uploaded the paper A Primal-Dual Homotopy Algorithm for {\ell_{1}}-Minimization with {\ell_{\infty}}-Constraints to the arXiv (and we missed being the first ever arXiv-paper with a non-trivial five-digit identifier by twenty-something papers…). This paper specifically deals with the optimization problem

\displaystyle \begin{array}{rcl} \min_{x}\|x\|_{1}\quad\text{s.t.}\quad \|Ax-b\|_{\infty}\leq \delta \end{array}

where {A} and {b} are a real matrix and vector, respecively, of compatible size. While the related problem with {\ell_{2}} constraint has been addressed quite often (and the penalized problem {\min_{x}\|x\|_{1} + \tfrac1{2\lambda}\|Ax-b\|_{2}^{2}} is even more popular) there is not much code around to solve this specific problem. Obvious candidates are

  • Linear optimization: The problem can be recast as a linear program: The constraint is basically linear already (rewriting it with help of the ones vector {\mathbf{1}} as {Ax\leq \delta\mathbf{1}+b}, {-Ax\leq \delta\mathbf{1}-b}) and for the objective one can, for example, perform a variable split {x = x^{+}-x^{-}}, {x^{-},x^{+}\geq 0} and then write {\|x\|_{1} = \mathbf{1}^{T}x^{+}+ \mathbf{1}^{T}x^{-}}.
  • Splitting methods: The problem is convex problem of the form {\min_{x}F(x) + G(Ax)} with

    \displaystyle \begin{array}{rcl} F(x) & = & \|x\|_{1}\\ G(y) & = & \begin{cases} 0 & \|y-b\|\leq\delta\\ \infty & \text{else.} \end{cases} \end{array}

    and hence, several methods for these kind of problem are available, such as the alternating direction method of multipliers or the Chambolle-Pock algorithm.

The formulation as linear program has the advantage that one can choose among a lot of highly sophisticated software tools and the second has the advantage that the methods are very easy to code, usually in just a few lines. Drawbacks are, that both methods do exploit too much specific structure of the problem.

Application of the problem with {\ell_{\infty}} constraint are, for example:

  • The Dantzig selector, a statistical estimation technique, were the problem is

    \displaystyle \begin{array}{rcl} \min_{x}\|x\|_{1}\quad\text{s.t.}\quad \|A^{T}(Ax-b)\|_{\infty}\leq\delta. \end{array}

  • Sparse dequantization as elaborated here by Jacques, Hammond and Fadili and applied here to de-quantizaton of speech signals by Christoph, Timo Gerkmann and myself.

We wanted to see if one of the most efficient methods known for sparse reconstruction with {\ell_{2}} penalty, namely the homotopy method, can be adapted to this case. The homotopy method for {\min_{x}\|x\|_{1} + \tfrac1{2\lambda}\|Ax-b\|_{2}^{2}} builds on the observation that the solution for {\lambda\geq \|A^{T}b\|_{\infty}} is zero and that the set of solutions {x_{\lambda}}, parameterized by the parameter {\lambda}, is piecewise linear. Hence, on can start from {\lambda_{0}= \|A^{T}b\|_{\infty}}, calculate which direction to go, how far the breakpoint is away, go there and start over. I’ve blogged on the homotopy method here already and there you’ll find some links to great software packages, but also the fact that there can be up to exponentially many breakpoints. However, in practice the homotopy method is usually blazingly fast and with some care, can be made numerically stable and accurate, see, e.g. our extensive study here (and here is the optimization online preprint).

The problem with {\ell_{\infty}} constraint seems similar, especially it is clear that for {\delta = \|b\|_{\infty}}, {x=0} is a solution. It is also not so difficult to see that there is a piecewise linear path of solutions {x_{\delta}}. What is not so clear is, how it can be computed. It turned out, that in this case the whole truth can be seen when the problem is viewed from a primal-dual viewpoint. The associated dual problem is

\displaystyle \begin{array}{rcl} \max_{y}\ -b^{T}y - \delta\|y\|_{1}\quad\text{s.t.}\quad\|A^{T}y\|_{\infty}\leq\infty \end{array}

and a pair {(x^{*},y^{*})} is primal and dual optimal if and only if

\displaystyle \begin{array}{rcl} -A^{T}y^{*}&\in&\text{Sign}(x^{*})\\ Ax^{*}-b & \in & \delta\text{Sign}(y^{*}) \end{array}

(where {\text{Sign}} denotes the sign function, multivalued at zero, giving {[-1,1]} there). One can note some things from the primal-dual optimality system:

  • You may find a direction {d} such that {(x^{*}+td,y^{*})} stays primal-dual optimal for the constraint {\leq\delta-t} for small {t},
  • for a fixed primal optimal {x_{\delta}} there may be several dual optimal {y_{\delta}}.

On the other hand, it is not that clear which of the probably many dual optimal {y_{\delta}} allows to find a new direction {d} such that {x_{\delta}+td} with stay primal optimal when reducing {\delta}. In fact, it turned out that, at a breakpoint, a new dual variable needs to be found to allow for the next jump in the primal variable. So, the solution path is piecewise linear in the primal variable, but piecewise constant in the dual variable (a situation similar to the adaptive inverse scale space method).

What we found is, that some adapted theorem of the alternative (a.k.a. Farkas’ Lemma) allows to calculate the next dual optimal {y} such that a jump in {x} will be possible.

What is more, is that the calculation of a new primal or dual optimal point amounts to solving a linear program (in contrast to a linear system for {\ell_{2}} homotopy). Hence, the trick of up- and downdating a suitable factorization of a suitable matrix to speed up computation does not work. However, one can somehow leverage the special structure of the problem and use a tailored active set method to progress through the path. Our numerical tests indicated is that the resulting method, which we termed {\ell_{1}}-Houdini, is able to solve moderately large problems faster than a commercial LP-solver (while also not only solving the given problem, but calculating the whole solution path on the fly) as can be seen from this table from the paper:

085_runtime_l1houdini

The code of \ell_1-Houdini is on Christoph’s homepage, you may also reproduce the data in the above table with your own hardware.

Yesterday I uploaded the paper “Linear convergence of the Randomized Sparse Kaczmarz Method” by Frank Schöpfer and myself to the arXiv.

Recall that the Kaczmarz method for linear systems

\displaystyle \begin{array}{rcl} Ax&=&b \end{array}

iterates

\displaystyle \begin{array}{rcl} x^{k+1} &=& x^{k} - \frac{\langle a_{i},x_{k}\rangle-b_{i}}{\|a_{i}\|^{2}}a_{i} \end{array}

where {a_{i}} is the {i}-th row of {A}, {b_{i}} is the {i}th entry of {b} and the index {i} is chosen according to some rule. We could, e.g. choose the rows in a cyclic manner, i.e. starting with the first row, proceeding to the next row and once we came to the last row we would start over from the first again. It is known (and probably proved by Kaczmarz himself) that the method converges to a solution of {Ax=b} whenever the system has a solution. Moreover, it easy to see that we converge to the minimum norm solution in case of underdetermined systems when the method is initialized with zero. This is due to the fact that the whole iteration takes place in the range space of {A^{T}}.

In this and this paper we proposed a simple modification of the Kaczmarz method, that makes it converge to sparse solutions. The modification is simply

\displaystyle \begin{array}{rcl} z^{k+1} & = & z^{k} - \frac{\langle a_{i},x_{k}\rangle-b_{i}}{\|a_{i}\|^{2}}a_{i}\\ x^{k+1}& = & S_{\lambda}(z^{k+1}) \end{array}

where {S_{\lambda}(x) = \max(|x|-\lambda,0)\text{sign}(x)} is the soft thresholding function. In this paper we proved that this variant converges, when initialized with zero and for a consistent system, to the solution of

\displaystyle \begin{array}{rcl} \min_{x}\lambda\|x\|_{1} + \tfrac12\|x\|_{2}^{2},\quad\text{s.t.}\quad Ax=b. \end{array}

For not too small values of {\lambda} this is indeed a sparse solution of {Ax=b} and Frank also proved that there is a threshold such that for {\lambda} larger than this threshold the solution is also the minimum {\ell^{1}} solution.

In general, convergence rates for the Kaczmarz method (and its sparse variant) are hard to prove. To convince oneself of this fact note that the convergence speed can drastically change if the rows of the system are reordered. The situation changes if one uses a randomization of the sparse Kaczmarz method where, in each iteration, a row is chose at random. Strohmer and Vershynin proved that this randomization leads to linear convergence rate. In the above mentioned paper we were able to prove the same result for the randomized sparse Kaczmarz method. While this sounds like an obvious generalization, the methods we use are totally different. While the linear convergence of the randomized Kaczmarz method can be proven in a few lines(well, probably one page) using only very basic tools, we need, among other things, quite intricate error bounds for Bregman projections w.r.t. piecewise linear-quadratic convex functions.

In fact, the linear convergence can be observed in practice and we illustrate the usefulness of the randomization and also the “sparsification” on some examples in the paper. For example the following figure shows the decay of the residual for the the randomized Kaczmarz method (black), the randomized sparse Kaczmarz method (red) and the randomized sparse Kaczmarz method with exact steps (green) which I did not explain here.

083_randomizedkaczmarz-figure0

More details are in the paper…

Next Page »