June 2018


This is a short follow up on my last post where I wrote about the sweet spot of the stepsize of the Douglas-Rachford iteration. For the case \beta-Lipschitz + \mu-strongly monotone, the iteration with stepsize t converges linear with rate

\displaystyle r(t) = \tfrac{1}{2(1+t\mu)}\left(\sqrt{2t^{2}\mu^{2}+2t\mu + 1 +2(1 - \tfrac{1}{(1+t\beta)^{2}} - \tfrac1{1+t^{2}\beta^{2}})t\mu(1+t\mu)} + 1\right)

Here is animated plot of this contraction factor depending on \beta and \mu and t acts as time variable:

DR_contraction

What is interesting is, that this factor has increasing or decreasing in t depending on the values of \beta and \mu.

For each pair (\beta,\mu) there is a best t^* and also a smallest contraction factor r(t^*). Here are plots of these quantities:
DR_opt_stepsize

Comparing the plot of te optimal contraction factor to the animated plot above, you see that the right choice of the stepsize matters a lot.

Advertisements

I blogged about the Douglas-Rachford method before here and here. It’s a method to solve monotone inclusions in the form

\displaystyle 0 \in Ax + Bx

with monotone multivalued operators {A,B} from a Hilbert space into itself. Using the resolvent {J_{A} = (I+A)^{-1}} and the reflector {R_{A} = 2J_{A} - I}, the Douglas-Rachford iteration is concisely written as

\displaystyle u^{n+1} = \tfrac12(I + R_{B}R_{A})u_{n}.

The convergence of the method has been clarified is a number of papers, see, e.g.

Lions, Pierre-Louis, and Bertrand Mercier. “Splitting algorithms for the sum of two nonlinear operators.” SIAM Journal on Numerical Analysis 16.6 (1979): 964-979.

for the first treatment in the context of monotone operators and

Svaiter, Benar Fux. “On weak convergence of the Douglas–Rachford method.” SIAM Journal on Control and Optimization 49.1 (2011): 280-287.

for a recent very general convergence result.

Since {tA} is monotone if {A} is monotone and {t>0}, we can introduce a stepsize for the Douglas-Rachford iteration

\displaystyle u^{n+1} = \tfrac12(I + R_{tB}R_{tA})u^{n}.

It turns out, that this stepsize matters a lot in practice; too small and too large stepsizes lead to slow convergence. It is a kind of folk wisdom, that there is “sweet spot” for the stepsize. In a recent preprint Quoc Tran-Dinh and I investigated this sweet spot in the simple case of linear operators {A} and {B} and this tweet has a visualization.

A few days ago Walaa Moursi and Lieven Vandenberghe published the preprint “Douglas-Rachford splitting for a Lipschitz continuous and a strongly monotone operator” and derived some linear convergence rates in the special case they mention in the title. One result (Theorem 4.3) goes as follows: If {A} is monotone and Lipschitz continuous with constant {\beta} and {B} is maximally monotone and {\mu}-strongly monotone, than the Douglas-Rachford iterates converge strongly to a solution with a linear rate

\displaystyle r = \tfrac{1}{2(1+\mu)}\left(\sqrt{2\mu^{2}+2\mu + 1 +2(1 - \tfrac{1}{(1+\beta)^{2}} - \tfrac1{1+\beta^{2}})\mu(1+\mu)} + 1\right).

This is a surprisingly complicated expression, but there is a nice thing about it: It allows to optimize for the stepsize! The rate depends on the stepsize as

\displaystyle r(t) = \tfrac{1}{2(1+t\mu)}\left(\sqrt{2t^{2}\mu^{2}+2t\mu + 1 +2(1 - \tfrac{1}{(1+t\beta)^{2}} - \tfrac1{1+t^{2}\beta^{2}})t\mu(1+t\mu)} + 1\right)

and the two plots of this function below show the sweet spot clearly.

 

If one knows the Lipschitz constant of {A} and the constant of strong monotonicity of {B}, one can minimize {r(t)} to get on optimal stepsize (in the sense that the guaranteed contraction factor is as small as possible). As Moursi and Vandenberghe explain in their Remark 5.4, this optimization involves finding the root of a polynomial of degree 5, so it is possible but cumbersome.

Now I wonder if there is any hope to show that the adaptive stepsize Quoc and I proposed here (which basically amounts to {t_{n} = \|u^{n}\|/\|Au^{n}\|} in the case of single valued {A} – note that the role of {A} and {B} is swapped in our paper) is able to find the sweet spot (experimentally it does).
<p

The fundamental theorem of calculus relates the derivative of a function to the function itself via an integral. A little bit more precise, it says that one can recover a function from its derivative up to an additive constant (on a simply connected domain).

In one space dimension, one can fix some {x_{0}} in the domain (which has to be an interval) and then set

\displaystyle F(x) = \int_{x_{0}}^{x}f'(t) dt.

Then {F(x) = f(x) + c} with {c= - f(x_{0})}.

Actually, a similar claim is true in higher space dimensions: If {f} is defined on a simply connected domain in {{\mathbb R}^{d}} we can recover {f} from its gradient up to an additive constant as follows: Select some {x_{0}} and set

\displaystyle F(x) = \int_{\gamma}\nabla f(x)\cdot dx \ \ \ \ \ (1)

for any path {\gamma} from {x_{0}} to {x}. Then it holds under suitable conditions that

\displaystyle F(x) = f(x) - f(x_{0}).

And now for something completely different: Convex functions and subgradients.

A function {f} on {{\mathbb R}^{n}} is convex if for every {x} there exists a subgradient {x^{*}\in{\mathbb R}^{n}} such that for all {y} one has the subgradient inequality

\displaystyle f(y) \geq f(x) + \langle x^{*}, y-x\rangle.

Writing this down for {x} and {y} with interchanged roles (and {y^{*}} as corresponding subgradient to {y}), we see that

\displaystyle \langle x-y,x^{*}-y^{*}\rangle \geq 0.

In other words: For a convex function {f} it holds that the subgradient {\partial f} is a (multivalued) monotone mapping. Recall that a multivalued map {A} is monotone, if for every {y \in A(x)} and {y^{*}\in A(y)} it holds that {\langle x-y,x^{*}-y^{*}\rangle \geq 0}. It is not hard to see that not every monotone map is actually a subgradient of a convex function (not even, if we go to “maximally monotone maps”, a notion that we sweep under the rug in this post). A simple counterexample is a (singlevalued) linear monotone map represented by

\displaystyle \begin{bmatrix} 0 & 1\\ -1 & 0 \end{bmatrix}

(which can not be a subgradient of some {f}, since it is not symmetric).

Another hint that monotonicity of a map does not imply that the map is a subgradient is that subgradients have some stronger properties than monotone maps. Let us write down the subgradient inequalities for a number of points {x_{0},\dots x_{n}}:

\displaystyle \begin{array}{rcl} f(x_{1}) & \geq & f(x_{0}) + \langle x_{0}^{*},x_{1}-x_{0}\rangle\\ f(x_{2}) & \geq & f(x_{1}) + \langle x_{1}^{*},x_{2}-x_{1}\rangle\\ \vdots & & \vdots\\ f(x_{n}) & \geq & f(x_{n-1}) + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle\\ f(x_{0}) & \geq & f(x_{n}) + \langle x_{n}^{*},x_{0}-x_{n}\rangle. \end{array}

If we sum all these up, we obtain

\displaystyle 0 \geq \langle x_{0}^{*},x_{1}-x_{0}\rangle + \langle x_{1}^{*},x_{2}-x_{1}\rangle + \cdots + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle + \langle x_{n}^{*},x_{0}-x_{n}\rangle.

This property is called {n}-cyclical monotonicity. In these terms we can say that a subgradient of a convex function is cyclical monotone, which means that it is {n}-cyclically monotone for every integer {n}.

By a remarkable result by Rockafellar, the converse is also true:

Theorem 1 (Rockafellar, 1966) Let {A} by a cyclically monotone map. Then there exists a convex function {f} such that {A = \partial f}.

Even more remarkable, the proof is somehow “just an application of the fundamental theorem of calculus” where we recover a function by its subgradient (up to an additive constant that depends on the basepoint).

Proof: we aim to “reconstruct” {f} from {A = \partial f}. The trick is to choose some base point {x_{0}} and corresponding {x_{0}^{*}\in A(x_{0})} and set

\displaystyle \begin{array}{rcl} f(x) &=& \sup\left\{ \langle x_{0}^{*}, x_{1}-x_{0}\rangle + \langle x_{1}^{*},x_{2}-x_{1}\rangle + \cdots + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle\right.\\&& \qquad\left. + \langle x_{n}^{*},x-x_{n}\rangle\right\}\qquad\qquad (0)\end{array}

where the supremum is taken over all integers {m} and all pairs {x_{i}^{*}\in A(x_{i})} {i=1,\dots, m}. As a supremum of affine functions {f} is clearly convex (even lower semicontinuous) and {f(x_{0}) = 0} since {A} is cyclically monotone (this shows that {f} is proper, i.e. not equal to {\infty} everywhere). Finally, for {\bar x^{*}\in A(\bar x)} we have

\displaystyle \begin{array}{rcl} f(x) &\geq& \sup\left\{ \langle x_{0}^{*}, x_{1}-x_{0}\rangle + \langle x_{1}^{*},x_{2}-x_{1}\rangle + \cdots + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle\right.\\ & & \qquad \left.+ \langle x_{n}^{*},\bar x-x_{n}\rangle + \langle \bar x^{*},\bar x-\bar x\rangle\right\} \end{array}

with the supremum taken over all integers {m} and all pairs {x_{i}^{*}\in A(x_{i})} {i=1,\dots, m}. The right hand side is equal to {f(x) + \langle \bar x^{*},x-\bar x\rangle} and this shows that {f} is indeed convex. \Box

Where did we use the fundamental theorem of calculus? Let us have another look at equation~(0). Just for simplicity, let us denote {x_{i}^{*} =\nabla f(x_{i})}. Now consider a path {\gamma} from {x_{0}} to {x} and points {0=t_{0}< t_{1}<\cdots < t_{n}< t_{n+1} = 1} with {\gamma(t_{i}) = x_{i}}. Then the term inside the supremum of~(0) equals

\displaystyle \langle \nabla f(\gamma(t_{0})),\gamma(t_{1})-\gamma(t_{0})\rangle + \dots + \langle \nabla f(\gamma(t_{n})),\gamma(t_{n+1})-\gamma(t_{n})\rangle.

This is Riemannian sum for an integral of the form {\int_{\gamma}\nabla f(x)\cdot dx}. By monotonicity of {f}, we increase this sum, if we add another point {\bar t} (e.g. {t_{i}<\bar t<t_{i+1}}, and hence, the supremum does converge to the integral, i.e.~(0) is equal to

\displaystyle f(x) = \int_{\gamma}\nabla f(x)\cdot dx

where {\gamma} is any path from {x_{0}} to {x}.