This is a short follow up on my last post where I wrote about the sweet spot of the stepsize of the Douglas-Rachford iteration. For the case \beta-Lipschitz + \mu-strongly monotone, the iteration with stepsize t converges linear with rate

\displaystyle r(t) = \tfrac{1}{2(1+t\mu)}\left(\sqrt{2t^{2}\mu^{2}+2t\mu + 1 +2(1 - \tfrac{1}{(1+t\beta)^{2}} - \tfrac1{1+t^{2}\beta^{2}})t\mu(1+t\mu)} + 1\right)

Here is animated plot of this contraction factor depending on \beta and \mu and t acts as time variable:

DR_contraction

What is interesting is, that this factor has increasing or decreasing in t depending on the values of \beta and \mu.

For each pair (\beta,\mu) there is a best t^* and also a smallest contraction factor r(t^*). Here are plots of these quantities:
DR_opt_stepsize

Comparing the plot of te optimal contraction factor to the animated plot above, you see that the right choice of the stepsize matters a lot.

Advertisements

I blogged about the Douglas-Rachford method before here and here. It’s a method to solve monotone inclusions in the form

\displaystyle 0 \in Ax + Bx

with monotone multivalued operators {A,B} from a Hilbert space into itself. Using the resolvent {J_{A} = (I+A)^{-1}} and the reflector {R_{A} = 2J_{A} - I}, the Douglas-Rachford iteration is concisely written as

\displaystyle u^{n+1} = \tfrac12(I + R_{B}R_{A})u_{n}.

The convergence of the method has been clarified is a number of papers, see, e.g.

Lions, Pierre-Louis, and Bertrand Mercier. “Splitting algorithms for the sum of two nonlinear operators.” SIAM Journal on Numerical Analysis 16.6 (1979): 964-979.

for the first treatment in the context of monotone operators and

Svaiter, Benar Fux. “On weak convergence of the Douglas–Rachford method.” SIAM Journal on Control and Optimization 49.1 (2011): 280-287.

for a recent very general convergence result.

Since {tA} is monotone if {A} is monotone and {t>0}, we can introduce a stepsize for the Douglas-Rachford iteration

\displaystyle u^{n+1} = \tfrac12(I + R_{tB}R_{tA})u^{n}.

It turns out, that this stepsize matters a lot in practice; too small and too large stepsizes lead to slow convergence. It is a kind of folk wisdom, that there is “sweet spot” for the stepsize. In a recent preprint Quoc Tran-Dinh and I investigated this sweet spot in the simple case of linear operators {A} and {B} and this tweet has a visualization.

A few days ago Walaa Moursi and Lieven Vandenberghe published the preprint “Douglas-Rachford splitting for a Lipschitz continuous and a strongly monotone operator” and derived some linear convergence rates in the special case they mention in the title. One result (Theorem 4.3) goes as follows: If {A} is monotone and Lipschitz continuous with constant {\beta} and {B} is maximally monotone and {\mu}-strongly monotone, than the Douglas-Rachford iterates converge strongly to a solution with a linear rate

\displaystyle r = \tfrac{1}{2(1+\mu)}\left(\sqrt{2\mu^{2}+2\mu + 1 +2(1 - \tfrac{1}{(1+\beta)^{2}} - \tfrac1{1+\beta^{2}})\mu(1+\mu)} + 1\right).

This is a surprisingly complicated expression, but there is a nice thing about it: It allows to optimize for the stepsize! The rate depends on the stepsize as

\displaystyle r(t) = \tfrac{1}{2(1+t\mu)}\left(\sqrt{2t^{2}\mu^{2}+2t\mu + 1 +2(1 - \tfrac{1}{(1+t\beta)^{2}} - \tfrac1{1+t^{2}\beta^{2}})t\mu(1+t\mu)} + 1\right)

and the two plots of this function below show the sweet spot clearly.

 

If one knows the Lipschitz constant of {A} and the constant of strong monotonicity of {B}, one can minimize {r(t)} to get on optimal stepsize (in the sense that the guaranteed contraction factor is as small as possible). As Moursi and Vandenberghe explain in their Remark 5.4, this optimization involves finding the root of a polynomial of degree 5, so it is possible but cumbersome.

Now I wonder if there is any hope to show that the adaptive stepsize Quoc and I proposed here (which basically amounts to {t_{n} = \|u^{n}\|/\|Au^{n}\|} in the case of single valued {A} – note that the role of {A} and {B} is swapped in our paper) is able to find the sweet spot (experimentally it does).
<p

Consider the saddle point problem

\displaystyle   \min_{x}\max_{y}F(x) + \langle Kx,y\rangle - G(y). \ \ \ \ \ (1)

(where I omit all the standard assumptions, like convexity, continuity ans such…). Fenchel-Rockafellar duality says that solutions are characterized by the inclusion

\displaystyle  0 \in\left( \begin{bmatrix} \partial F & 0\\ 0 & \partial G \end{bmatrix} + \begin{bmatrix} 0 & K^{T}\\ -K & 0 \end{bmatrix}\right) \begin{bmatrix} x^{*}\\y^{*} \end{bmatrix}

Noting that the operators

\displaystyle  A = \begin{bmatrix} \partial F & 0\\ 0 & \partial G \end{bmatrix},\quad B = \begin{bmatrix} 0 & K^{T}\\ -K & 0 \end{bmatrix}

are both monotone, we may apply any of the splitting methods available, for example the Douglas-Rachford method. In terms of resolvents

\displaystyle  R_{tA}(z) := (I+tA)^{-1}(z)

this method reads as

\displaystyle  \begin{array}{rcl}  z^{k+1} & = & R_{tB}(\bar z^{k})\\ \bar z^{k+1}& = & R_{tA}(2z^{k+1}-\bar z^{k}) + \bar z^{k}-z^{k+1}. \end{array}

For the saddle point problem, this iteration is (with {z = (x,y)})

\displaystyle  \begin{array}{rcl}  x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ y^{k+1} &=& R_{t\partial G}(\bar y^{k})\\ \begin{bmatrix} \bar x^{k+1}\\ \bar y^{k+1} \end{bmatrix} & = & \begin{bmatrix} I & tK^{T}\\ -tK & I \end{bmatrix}^{-1} \begin{bmatrix} 2x^{k+1}-\bar x^{k}\\ 2y^{k+1}-\bar y^{k} \end{bmatrix} + \begin{bmatrix} \bar x^{k}- x^{k+1}\\ \bar y^{k}-y^{k+1} \end{bmatrix}. \end{array}

The first two lines involve proximal steps and we assume that they are simple to implement. The last line, however, involves the solution of a large linear system. This can be broken down to a slightly smaller linear system involving the matrix {(I+t^{2}K^{T}K)} as follows: The linear system equals

\displaystyle  \begin{array}{rcl}  \bar x^{k+1} & = & x^{k+1} - tK^{T}(y^{k+1}+\bar y^{k+1}-\bar y^{k})\\ \bar y^{k+1} & = & y^{k+1} + tK(x^{k+1} + \bar x^{k+1}-\bar x^{k}). \end{array}

Plugging {\bar y^{k+1}} from the second equation into the first gives

\displaystyle  \bar x^{k+1} = x^{k+1} - tK^{T}(2y^{k+1}-\bar y^{k}) - tK^{T}K(x^{k+1}-\bar x^{k+1}-\bar x^{k})

Denoting {d^{k+1}= x^{k+1}+\bar x^{k+1}-\bar x^{k}} this can be written as

\displaystyle  (I+t^{2}K^{T}K)d^{k+1} = (2x^{k+1}-\bar x^{k}) - tK^{T}(2y^{k+1}-\bar y^{k}).

and the second equation is just

\displaystyle  \bar y^{k+1} = y^{k+1} + tKd^{k+1}.

This gives the overall iteration

\displaystyle  \begin{array}{rcl}  x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ y^{k+1} &=& R_{t\partial G}(\bar y^{k})\\ d^{k+1} &=& (I+t^{2}K^{T}K)^{-1}(2x^{k+1}-\bar x^{k} - tK(2y^{k+1}-\bar y^{k}))\\ \bar x^{k+1}&=& \bar x^{k}-x^{k+1}+d^{k+1}\\ \bar y^{k+1}&=& y^{k+1}+tKd^{k+1} \end{array}

This is nothing else than using the Schur complement or factoring as

\displaystyle  \begin{bmatrix} I & tK^{T}\\ -tK & I \end{bmatrix} = \begin{bmatrix} 0 & 0\\ 0 & I \end{bmatrix} + \begin{bmatrix} I\\tK \end{bmatrix} (I + t^{2}K^{T}K)^{-1} \begin{bmatrix} I & -tK^{T} \end{bmatrix}

and has been applied to imaging problems by O’Connor and Vandenberghe in “Primal-Dual Decomposition by Operator Splitting and Applications to Image Deblurring” (doi). For many problems in imaging, the involved inversion may be fairly easy to perform (if {K} is the image gradient, for example, we only need to solve an equation with an operator like {(I - t^{2}\Delta)} and appropriate boundary conditions). However, there are problems where this inversion is a problem.

I’d like to show the following trick to circumvent the matrix inversion, which I learned from Bredies and Sun’s “Accelerated Douglas-Rachford methods for the solution of convex-concave saddle-point problems”: Here is a slightly different saddle point problem

\displaystyle   \min_{x}\max_{y,x_{p}}F(x) + \langle Kx,y\rangle + \langle Hx,x_{p}\rangle- G(y) - I_{\{0\}}(x_{p}). \ \ \ \ \ (2)

We added a new dual variable {x_{p}}, which is forced to be zero by the additional indicator functional {I_{\{0\}}}. Hence, the additional bilinear term {\langle Hx,x_{p}\rangle} is also zero, and we see that {(x,y)} is a solution of (1) if and only if {(x,y,0)} is a solution of (2). In other words: The problem just looks differently, but is, in essence, the same as before.

Now let us write down the Douglas-Rachford iteration for (2). We write this problem as

\displaystyle  \min_{x}\max_{\tilde y} F(x) + \langle \tilde Kx,\tilde y\rangle -\tilde G(\tilde y)

with

\displaystyle  \tilde y = \begin{bmatrix} y\\x_{p} \end{bmatrix}, \quad \tilde K = \begin{bmatrix} K\\H \end{bmatrix}, \quad \tilde G(\tilde y) = \tilde G(y,x_{p}) = G(y) + I_{\{0\}}(x_{p}).

Writing down the Douglas-Rachford iteration gives

\displaystyle  \begin{array}{rcl}  x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ \tilde y^{k+1} &=& R_{t\partial \tilde G}(\bar{ \tilde y}^{k})\\ \begin{bmatrix} \bar x^{k+1}\\ \bar {\tilde y}^{k+1} \end{bmatrix} & = & \begin{bmatrix} I & t\tilde K^{T}\\ -t\tilde K & I \end{bmatrix}^{-1} \begin{bmatrix} 2x^{k+1}-\bar x^{k}\\ 2\tilde y^{k+1}-\bar {\tilde y}^{k} \end{bmatrix} + \begin{bmatrix} \bar x^{k}- x^{k+1}\\ \bar {\tilde y}^{k}-\tilde y^{k+1} \end{bmatrix}. \end{array}

Switching back to variables without a tilde, we get, using {R_{tI_{\{0\}}}(x) = 0},

\displaystyle  \begin{array}{rcl}  x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ y^{k+1} &=& R_{t\partial \tilde G}(\bar{ y}^{k})\\ x_{p}^{k+1} &=& 0\\ \begin{bmatrix} \bar x^{k+1}\\ \bar {y}^{k+1}\\ \bar x_{p}^{k+1} \end{bmatrix} & = & \begin{bmatrix} I & tK^{T} & tH^{T}\\ -t K & I & 0\\ -t H & 0 & I \end{bmatrix}^{-1} \begin{bmatrix} 2x^{k+1}-\bar x^{k}\\ 2 y^{k+1}-\bar {y}^{k}\\ 2x_{p}^{k+1}-\bar x_{p}^{k} \end{bmatrix} + \begin{bmatrix} \bar x^{k}- x^{k+1}\\ \bar {y}^{k}-y^{k+1}\\ \bar x_{p}^{k}-x_{p}^{k+1} \end{bmatrix}. \end{array}

First not that {x_{p}^{k+1}=0} throughout the iteration and from the last line of the linear system we get that

\displaystyle  \begin{array}{rcl}  -tH\bar x^{k+1} + \bar x_{p}^{k+1} = -\bar x_{p}^{k} -tH(\bar x^{k}-x^{k+1}) + \bar x_{p}^{k} \end{array}

which implies that

\displaystyle  \bar x_{p}^{k+1} = tH\bar x^{k+1}.

Thus, both variables {x_{p}^{k}} and {\bar x_{p}^{k}} disappear in the iteration. Now we rewrite the remaining first two lines of the linear system as

\displaystyle  \begin{array}{rcl}  \bar x^{k+1} + tK^{T}\bar y^{k+1} + t^{2}H^{T}H\bar x^{k+1} &=& x^{k+1} + tK^{T}(\bar y^{k}-y^{k+1}) + t^{2}H^{T}H\bar x^{k}\\ \bar y^{k+1}-tK\bar x^{k+1} &=& y^{k+1} + tK(x^{k+1}-\bar x^{k}). \end{array}

Again denoting {d^{k+1}=x^{k+1}+\bar x^{k+1}-\bar x^{k}}, solving the second equation for {\bar y^{k+1}} and plugging the result in the first gives

\displaystyle  (I+t^{2}H^{T}H)\bar x^{k+1} +tK^{T}(y^{k+1}+tKd^{k+1}) = x^{k+1}+tK(\bar y^{k}-y^{k+1}) + t^{2}H^{T}H\bar x^{k}.

To eliminate {\bar x^{k+1}} we add {(I+t^{2}H^{T}H)(x^{k+1}-\bar x^{k})} on both sides and get

\displaystyle  (I+t^{2}(H^{T}H+K^{T}K))d^{k+1} = 2x^{k+1}-\bar x^{k} -tK(y^{k+1}+\bar y^{k+1}-\bar y^{k}) + t^{2}H^{T}Hx^{k+1}.

In total we obtain the following iteration:

\displaystyle  \begin{array}{rcl}  x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ y^{k+1} &=& R_{t\partial G}(\bar y^{k})\\ d^{k+1} &=& (I+t^{2}(H^{T}H + K^{T}K))^{-1}(2x^{k+1}-\bar x^{k} - tK(2y^{k+1}-\bar y^{k}) + t^{2}H^{T}Hx^{k+1})\\ \bar x^{k+1}&=& \bar x^{k}-x^{k+1}+d^{k+1}\\ \bar y^{k+1}&=& y^{k+1}+tKd^{k+1} \end{array}

and note that only the third line changed.

Since the above works for any matrix {H}, we have a lot of freedom. Let us see, that it is even possible to avoid any inversion whatsoever: We would like to choose {H} in a way that {I+t^{2}(H^{T}H + K^{T}K) = \lambda I} for some positive {\lambda}. This is equivalent to

\displaystyle  H^{T}H = \tfrac{\lambda-1}{t^{2}}I - K^{T}K.

As soon as the right hand side is positive definite, Cholesky decomposition shows that such an {H} exists, and this happens if {\lambda\geq 1+t^{2}\|K\|^{2}}. Further note, that we do need {H} in any way, but only {H^{T}H}, and we can perform the iteration without ever solving any linear system since the third row reads as

\displaystyle  d^{k+1} = \tfrac{1}{\lambda}\left(2x^{k+1}-\bar x^{k} - tK(2y^{k+1}-\bar y^{k}) + ((\lambda-1)I - t^{2}K^{T}K)x^{k+1})\right).

I blogged about the Douglas-Rachford method before and in this post I’d like to dig a bit into the history of the method.

As the name suggests, the method has its roots in a paper by Douglas and Rachford and the paper is

Douglas, Jim, Jr., and Henry H. Rachford Jr., “On the numerical solution of heat conduction problems in two and three space variables.” Transactions of the American mathematical Society 82.2 (1956): 421-439.

At first glance, the title does not suggest that the paper may be related to monotone inclusions and if you read the paper you’ll not find any monotone operator mentioned. So let’s start and look at Douglas and Rachford’s paper.

1. Solving the heat equation numerically

So let us see, what they were after and how this is related to what is known as Douglas-Rachford splitting method today.

Indeed, Douglas and Rachford wanted to solve the instationary heat equation

\displaystyle \begin{array}{rcl} \partial_{t}u &=& \partial_{xx}u + \partial_{yy}u \\ u(x,y,0) &=& f(x,y) \end{array}

with Dirichlet boundary conditions (they also considered three dimensions, but let us skip that here). They considered a rectangular grid and a very simple finite difference approximation of the second derivatives, i.e.

\displaystyle \begin{array}{rcl} \partial_{xx}u(x,y,t)&\approx& (u^{n}_{i+1,j}-2u^{n}_{i,j}+u^{n}_{i-1,j})/h^{2}\\ \partial_{yy}u(x,y,t)&\approx& (u^{n}_{i,j+1}-2u^{n}_{i,j}+u^{n}_{i,j-1})/h^{2} \end{array}

(with modifications at the boundary to accomodate the boundary conditions). To ease notation, we abbreviate the difference quotients as operators (actually, also matrices) that act for a fixed time step

\displaystyle \begin{array}{rcl} (Au^{n})_{i,j} &=& (u^{n}_{i+1,j}-2u^{n}_{i,j}+u^{n}_{i-1,j})/h^{2}\\ (Bu^{n})_{i,j} &=& (u^{n}_{i,j+1}-2u^{n}_{i,j}+u^{n}_{i,j+1})/h^{2}. \end{array}

With this notation, our problem is to solve

\displaystyle \begin{array}{rcl} \partial_{t}u &=& (A+B)u \end{array}

in time.

Then they give the following iteration:

\displaystyle Av^{n+1}+Bw^{n} = \frac{v^{n+1}-w^{n}}{\tau} \ \ \ \ \ (1)

 

\displaystyle Bw^{n+1} = Bw^{n} + \frac{w^{n+1}-v^{n+1}}{\tau} \ \ \ \ \ (2)

 

(plus boundary conditions which I’d like to swipe under the rug here). If we eliminate {v^{n+1}} from the first equation using the second we get

\displaystyle (A+B)w^{n+1} = \frac{w^{n+1}-w^{n}}{\tau} + \tau AB(w^{n+1}-w^{n}). \ \ \ \ \ (3)

 

This is a kind of implicit Euler method with an additional small term {\tau AB(w^{n+1}-w^{n})}. From a numerical point of it has one advantage over the implicit Euler method: As equations (1) and (2) show, one does not need to invert {I-\tau(A+B)} in every iteration, but only {I-\tau A} and {I-\tau B}. Remember, this was in 1950s, and solving large linear equations was a much bigger problem than it is today. In this specific case of the heat equation, the operators {A} and {B} are in fact tridiagonal, and hence, solving with {I-\tau A} and {I-\tau B} can be done by Gaussian elimination without any fill-in in linear time (read Thomas algorithm). This is a huge time saver when compared to solving with {I-\tau(A+B)} which has a fairly large bandwidth (no matter how you reorder).

How do they prove convergence of the method? They don’t since they wanted to solve a parabolic PDE. They were after stability of the scheme, and this can be done by analyzing the eigenvalues of the iteration. Since the matrices {A} and {B} are well understood, they were able to write down the eigenfunctions of the operator associated to iteration (3) explicitly and since the finite difference approximation is well understood, they were able to prove approximation properties. Note that the method can also be seen, as a means to calculate the steady state of the heat equation.

We reformulate the iteration (3) further to see how {w^{n+1}} is actually derived from {w^{n}}: We obtain

\displaystyle (-I + \tau(A+B) - \tau^{2}AB)w^{n+1} = (-I-\tau^{2}AB)w^{n} \ \ \ \ \ (4)

 

2. What about monotone inclusions?

What has the previous section to do with solving monotone inclusions? A monotone inclusion is

\displaystyle \begin{array}{rcl} 0\in Tx \end{array}

with a monotone operator, that is, a multivalued mapping {T} from a Hilbert space {X} to (subsets of) itself such that for all {x,y\in X} and {u\in Tx} and {v\in Ty} it holds that

\displaystyle \begin{array}{rcl} \langle u-v,x-y\rangle\geq 0. \end{array}

We are going to restrict ourselves to real Hilbert spaces here. Note that linear operators are monotone if they are positive semi-definite and further note that monotone linear operators need not to be symmetric. A general approach to the solution of monotone inclusions are so-called splitting methods. There one splits {T} additively {T=A+B} as a sum of two other monotone operators. Then one tries to use the so-called resolvents of {A} and {B}, namely

\displaystyle \begin{array}{rcl} R_{A} = (I+A)^{-1},\qquad R_{B} = (I+B)^{-1} \end{array}

to obtain a numerical method. By the way, the resolvent of a monotone operator always exists and is single valued (to be honest, one needs a regularity assumption here, namely one need maximal monotone operators, but we will not deal with this issue here).

The two operators {A = \partial_{xx}} and {B = \partial_{yy}} from the previous section are not monotone, but {-A} and {-B} are, so the equation {-Au - Bu = 0} is a special case of a montone inclusion. To work with monotone operators we rename

\displaystyle \begin{array}{rcl} A \leftarrow -A,\qquad B\leftarrow -B \end{array}

and write the iteration~(4) in terms of monotone operators as

\displaystyle \begin{array}{rcl} (I + \tau(A+B) + \tau^{2}AB)w^{n+1} = (I+\tau^{2}AB)w^{n}, \end{array}

i.e.

\displaystyle \begin{array}{rcl} w^{n+1} = (I+\tau A+\tau B+\tau^{2}AB)^{-1}(I+\tau AB)w^{n}. \end{array}

Using {I+\tau A+\tau B + \tau^{2}A = (I+\tau A)(I+\tau B)} and {(I+\tau^{2}AB) = (I-\tau B) + (I + \tau A)\tau B} we rewrite this in terms of resolvents as

\displaystyle \begin{array}{rcl} w^{n+1} & = &(I+\tau B)^{-1}[(I+\tau A)^{-1}(I-\tau B) + \tau B]w^{n}\\ & =& R_{\tau B}(R_{\tau A}(w^{n}-\tau Bw^{n}) + \tau Bw^{n}). \end{array}

This is not really applicable to a general monotone inclusion since there {A} and {B} may be multi-valued, i.e. the term {Bw^{n}} is not well defined (the iteration may be used as is for splittings where {B} is monotone and single valued, though).

But what to do, when both and {A} and {B} are multivaled? The trick is, to introduce a new variable {w^{n} = R_{\tau B}(u^{n})}. Plugging this in throughout leads to

\displaystyle \begin{array}{rcl} R_{\tau B} u^{n+1} & = & R_{\tau B}(R_{\tau A}(R_{\tau B}u^{n}-\tau B R_{\tau B}u^{n}) + \tau B R_{\tau B}u^{n}). \end{array}

We cancel the outer {R_{\tau B}} and use {\tau B R_{\tau B}u^{n} = u^{n} - R_{\tau B}u^{n}} to get

\displaystyle \begin{array}{rcl} u^{n+1} & = & R_{\tau A}(2R_{\tau B}u^{n} - u^{n}) + u^{n} - R_{\tau B}u^{n} \end{array}

and here we go: This is exactly what is known as Douglas-Rachford method (see the last version of the iteration in my previous post). Note that it is not {u^{n}} that converges to a solution, but {w^{n} = R_{\tau B}u^{n}}, so it is convenient to write the iteration in the two variables

\displaystyle \begin{array}{rcl} w^{n} & = & R_{\tau B}u^{n}\\ u^{n+1} & = & R_{\tau A}(2w^{n} - u^{n}) + u^{n} - w^{n}. \end{array}

The observation, that these splitting method that Douglas and Rachford devised for linear problems has a kind of much wider applicability is due to Lions and Mercier and the paper is

Lions, Pierre-Louis, and Bertrand Mercier. “Splitting algorithms for the sum of two nonlinear operators.” SIAM Journal on Numerical Analysis 16.6 (1979): 964-979.

Other, much older, splitting methods for linear systems, such as the Jacobi method, the Gauss-Seidel method used different properties of the matrices such as the diagonal of the matrix or the upper and lower triangluar parts and as such, do not generalize easily to the case of operators on a Hilbert space.